A Multi-Dimensional Evaluation Framework for Goal-Shift Robustness in Conversational AI
- Motivation
- What's in the Benchmark
- Evaluation Metrics
- Key Findings
- Domains
- Installation
- Quick Start
- CLI Reference
- Evaluate Your Own Agent
- Simulation Architecture
- Citation
- Authors
- License
Most LLM agent benchmarks assume user goals stay fixed throughout a conversation. This oversimplifies real-world deployments, where users frequently re-prioritize tasks, introduce new constraints, or shift objectives mid-dialogue. For example, a banking customer might authenticate their identity, pivot to reviewing transactions, and then escalate to disputing a fraudulent charge, all in one interaction.
AgentChangeBench is the first benchmark explicitly designed to test how tool-augmented agents detect, adapt to, and recover from mid-conversation goal shifts, while also measuring how well they tailor communication to users with different levels of expertise, cooperation, and trust.
Accepted to the NeurIPS 2025 Workshop on Multi-Turn Interactions in Large Language Models.
- 315 systematically validated tasks across 3 enterprise domains, each annotated with explicit goal sequences
- 2,835 task sequences total, generated across trials and personas
- 5 user personas (
EASY_1,EASY_2,MEDIUM_1,MEDIUM_2,HARD_1) varying in expertise, cooperation, and trust, each designed to trigger realistic shift points - 3 evaluation domains: banking, retail, and airline
AgentChangeBench goes beyond binary pass@k scores with four complementary metrics:
Measures whether the agent completed the intended task via a weighted average across three evaluation channels:
TSR = 0.25 × communicate_info_rate + 0.45 × action_rate + 0.30 × nl_assertion_rate
Combines tool correctness T (fraction of tool calls that execute successfully) and parameter validity P (fraction of calls whose arguments satisfy the schema):
TUE = 0.6 × T + 0.4 × P
Measures wasted effort; how many tool calls were redundant after a goal shift occurred.
Measures adaptation latency: turns from a user goal shift to acknowledgment, first relevant tool call, and task completion. Lower is better.
| Feature | τ²-bench | AgentChangeBench |
|---|---|---|
| Goal dynamics | Static | Mid-dialogue shifts |
| Persona coverage | Limited | 5 distinct personas |
| Primary metric | pass@k |
TSR, TUE, TCRR, GSRT |
| Tool evaluation | Basic | Correctness + validity + redundancy |
| Recovery measurement | ❌ | ✅ |
Experiments across GPT-4o, Claude-3.7-Sonnet, and Gemini-2.5-Flash reveal sharp contrasts hidden by traditional accuracy metrics:
- Claude-3.7-Sonnet recovers fastest from goal shifts across all domains
- GPT-4o delivers the most balanced cross-domain performance, reaching 92.2% recovery on airline booking goal shifts
- Gemini-2.5-Flash drops to 48.6% recovery in banking but remains competitive in retail
- Retail tasks show near-perfect parameter validity yet redundancy rates above 80%. Agents kept making unnecessary tool calls even after achieving what they needed
High raw accuracy does not imply robustness under dynamic goals. Measuring recovery time and redundancy is essential.
Each domain defines a policy the agent must follow, a set of available tools, and a set of goal-shifted task sequences. The mock domain is a sandbox for development.
| Domain | Description |
|---|---|
banking |
Identity auth → transaction review → fraud dispute workflows |
retail |
Order management and product inquiries with mid-session pivots |
airline |
Flight booking, changes, and cancellations |
mock |
Minimal sandbox for development and testing |
Requires Python 3.10+
# 1. Clone the repository
git clone https://github.com/Maniktherana/AgentChangeBench
cd AgentChangeBench
# 2. Install with uv
uv sync
This installs all dependencies and enables the tau2 CLI.
Note: If you use
uv pip install .instead ofuv sync, set the data directory manually:export TAU2_DATA_DIR=/path/to/your/tau2-bench/data
Verify your setup after installation:
tau2 check-data
Clean generated files and the virtual environment:
make clean
AgentChangeBench uses LiteLLM, so any LiteLLM-compatible LLM provider works.
cp .env.example .env
# Edit .env with your API keys
Run a quick evaluation on 5 tasks with 1 trial each:
tau2 run \
--domain airline \
--agent-llm gpt-4.1 \
--user-llm gpt-4.1 \
--agent llm_agent \
--user banking_user_simulator \
--num-trials 1 \
--num-tasks 5
Results are saved to data/tau2/simulations/.
tau2 run \
--domain <domain> \
--agent-llm <llm_name> \
--user-llm <llm_name> \
--num-trials <trial_count> \
--task-ids <task_ids> \
--max-concurrency <concurrent_sims>
tau2 view
Browse simulation files, view per-metric agent performance, inspect individual simulations, and explore task details.
tau2 check-data
To plug in a local or remote custom agent, see the agent developer guide. All domain-specific policy and API documentation accessible to agent developers is available via tau2 domain <domain>.
The orchestrator passes messages between a user simulator, an agent, and a domain environment. On each turn, one of three transitions occurs:
sequenceDiagram
participant O as Orchestrator
participant A as Agent
participant U as UserSimulator
participant E as Environment
Note over O: Initialize(task)
O->>A: get_init_state_info(message_history)
A->>O: agent_state_info
O->>U: get_init_state_info(message_history)
U->>O: user_state_info
O->>E: set_state(init_data, init_actions, history)
loop Each turn
alt Agent/Env → User
O->>U: generate_next_message(msg, user_state_info)
U-->>O: (user_msg, user_state_info)
Note over O: Check STOP signal
else User/Env → Agent
O->>A: generate_next_message(msg, agent_state_info)
A-->>O: (assistant_msg, agent_state_info)
else Tool call → Environment
O->>E: get_response(tool_call)
E-->>O: tool_message
end
Note over O: Check max turns
end
Note over O: Return simulation run
If you use AgentChangeBench in your research, please cite:
@misc{rana2025agentchangebench,
title = {AgentChangeBench: A Multi-Dimensional Evaluation Framework for Goal-Shift Robustness in Conversational AI},
author = {Manik Rana and Calissa Man and Anotida Expected Msiiwa and Jeffrey Paine and Kevin Zhu and Sunishchal Dev and Vasu Sharma and Ahan M R},
year = {2025},
eprint = {2510.18170},
archivePrefix = {arXiv},
primaryClass = {cs.AI},
url = {https://arxiv.org/abs/2510.18170}
}
Manik Rana · Calissa Man · Anotida Expected Msiiwa · Jeffrey Paine · Kevin Zhu · Sunishchal Dev · Vasu Sharma · Ahan M R