MAESTRO is a framework-agnostic evaluation suite for LLM-based multi-agent systems (MAS), developed to benchmark, observe, and analyze MAS behavior from a systems perspective.
The stochastic and dynamic execution of LLM-powered MAS makes it difficult to debug and analyze system performance, understand resource usage per component, and determine whether agents should be split or optimized for efficiency. MAESTRO addresses this by providing a representative set of ready-to-run scenarios, a unified evaluation interface, and framework-agnostic metrics to measure performance, reliability, and system-level behavior across heterogeneous MAS setups.
What MAESTRO provides:
- A repository of representative MAS instances
- A unified configuration interface for consistent cross-framework MAS evaluation
- Execution traces and system-level signals including latency, network utilization, and failures
MAS examples can be found under the examples directory.
You can run multiple benchmarks using the run_benchmarks.py script inside that folder.
For example-specific instructions, click the links in the table below to open each example's README.
| Example | App. Field | Framework | Int. Type | #Agents | #Tools | Data In | Data Out |
|---|---|---|---|---|---|---|---|
| Fin. Analyzer | Finance | MCP-Agent | Correct | 6 | 1 | Artifacts | Opn-End |
| Img. Scr. | Creativity | ADK | Debate | 4 | 2 | Artifacts | Cls-Form |
| Marketing | Marketing | ADK | Coord. | 4 | 1 | Artifacts | Opn-End |
| Brand SEO | Marketing | ADK | Coord. | 4 | 10 | Artifacts | Opn-End |
| Content Creat. | Creativity | ADK | Plan | 4 | 1 | Artifacts | Opn-End |
| Mag.-One | Cross-domain | Autogen | Plan | 4 | 0 | Artifacts | Opn-End |
| Stock Res. | Finance | Autogen | Coord. | 4 | 2 | Artifacts | Opn-End |
| Travel Plan. | Travel | Autogen | Coord. | 4 | 0 | Artifacts | Opn-End |
| ToT | Cross-domain | LangGraph | Debate | 3 | 0 | Artifacts | Cls-Form |
| CRAG | Cross-domain | LangGraph | Coord. | 5 | 2 | Datasets | Opn-End |
| Plan & Exec. | Cross-domain | LangGraph | Plan | 3 | 1 | Datasets | Opn-End |
| LATS | Cross-domain | LangGraph | Plan | 3 | 1 | Datasets | Opn-End |
We open source our collected datasets for MAS execution and resource usage, which is available at https://huggingface.co/datasets/kaust-generative-ai/maestro-mas-benchmark
- Trace file:
data/example/run_20251223_020529.otel.jsonl - Summary: 76 spans, 25
call_llmspans, 51invoke_agentspans, 70,197 input tokens, 2,974 output tokens, ~138.7s wall time, max replanner retry attempt 24, final status ERROR (GraphRecursionError).
{
"trace_id": "01a0b8e663cdb64112fe4c44291b0e33",
"span_id": "000c8e48d640b5e7",
"name": "plan_execute.call_llm.planner",
"agent_name": "plan_and_execute_benchmark.llm.planner",
"start_time": 1766455529802170196,
"end_time": 1766455532579976422,
"duration_ns": 2777806226,
"attributes": {
"gen_ai.operation.name": "call_llm",
"langgraph.phase": "planner",
"gen_ai.request.model": "gpt-4o-mini",
"gen_ai.usage.total_tokens": 251,
"gen_ai.llm.call.count": 1,
"agent.output.useless": false
}
}
{
"trace_id": "01a0b8e663cdb64112fe4c44291b0e33",
"span_id": "093c5ef419cb8831",
"name": "plan_execute.node.planner",
"agent_name": "plan_and_execute_benchmark.node.planner",
"start_time": 1766455529802089867,
"end_time": 1766455532583673105,
"duration_ns": 2781583238,
"attributes": {
"gen_ai.operation.name": "invoke_agent",
"plan_execute.node": "planner",
"plan_execute.plan.step_count": 4,
"plan_execute.plan.preview": [
"Identify the time period of the Russian Civil War...",
"Determine the key events leading to the defeat...",
"Research the date when the Russian Civil War effectively ended..."
]
}
}
{
"trace_id": "01a0b8e663cdb64112fe4c44291b0e33",
"span_id": "7def16b387d7b677",
"name": "plan_execute.run",
"agent_name": "plan_and_execute_benchmark.run",
"start_time": 1766455529800856493,
"end_time": 1766455668502872238,
"duration_ns": 138702015745,
"status": {
"status_code": "ERROR",
"description": "GraphRecursionError: Recursion limit of 50 reached without hitting a stop condition..."
},
"attributes": {
"gen_ai.operation.name": "invoke_agent",
"run.outcome": "failure",
"run.judgement": "wrong"
}
}If you find MAESTRO or its dataset useful in your research, please consider citing the following paper:
@misc{maestro,
title={MAESTRO: Multi-Agent Evaluation Suite for Testing, Reliability, and Observability},
author={Tie Ma and Yixi Chen and Vaastav Anand and Alessandro Cornacchia and Amândio R. Faustino and Guanheng Liu and Shan Zhang and Hongbin Luo and Suhaib A. Fahmy and Zafar A. Qazi and Marco Canini},
year={2026},
eprint={2601.00481},
archivePrefix={arXiv},
primaryClass={cs.NI},
url={https://arxiv.org/abs/2601.00481},
}
git clone git@github.com:sands-lab/maestro.git
cd maestro
uv sync
# Install pre-commit hooks
uv run -- pre-commit install
