MAESTRO: Multi-Agent Evaluation Suite for Testing, Reliability, and Observability

MAESTRO is a framework-agnostic evaluation suite for LLM-based multi-agent systems (MAS), developed to benchmark, observe, and analyze MAS behavior from a systems perspective.

The stochastic and dynamic execution of LLM-powered MAS makes it difficult to debug and analyze system performance, understand resource usage per component, and determine whether agents should be split or optimized for efficiency. MAESTRO addresses this by providing a representative set of ready-to-run scenarios, a unified evaluation interface, and framework-agnostic metrics to measure performance, reliability, and system-level behavior across heterogeneous MAS setups.

What MAESTRO provides:

A repository of representative MAS instances
A unified configuration interface for consistent cross-framework MAS evaluation
Execution traces and system-level signals including latency, network utilization, and failures

Available MAS examples

MAS examples can be found under the examples directory. You can run multiple benchmarks using the run_benchmarks.py script inside that folder. For example-specific instructions, click the links in the table below to open each example's README.

Example	App. Field	Framework	Int. Type	#Agents	#Tools	Data In	Data Out
Fin. Analyzer	Finance	MCP-Agent	Correct	6	1	Artifacts	Opn-End
Img. Scr.	Creativity	ADK	Debate	4	2	Artifacts	Cls-Form
Marketing	Marketing	ADK	Coord.	4	1	Artifacts	Opn-End
Brand SEO	Marketing	ADK	Coord.	4	10	Artifacts	Opn-End
Content Creat.	Creativity	ADK	Plan	4	1	Artifacts	Opn-End
Mag.-One	Cross-domain	Autogen	Plan	4	0	Artifacts	Opn-End
Stock Res.	Finance	Autogen	Coord.	4	2	Artifacts	Opn-End
Travel Plan.	Travel	Autogen	Coord.	4	0	Artifacts	Opn-End
ToT	Cross-domain	LangGraph	Debate	3	0	Artifacts	Cls-Form
CRAG	Cross-domain	LangGraph	Coord.	5	2	Datasets	Opn-End
Plan & Exec.	Cross-domain	LangGraph	Plan	3	1	Datasets	Opn-End
LATS	Cross-domain	LangGraph	Plan	3	1	Datasets	Opn-End

Datasets

We open source our collected datasets for MAS execution and resource usage, which is available at https://huggingface.co/datasets/kaust-generative-ai/maestro-mas-benchmark

Trace Examples

Plan-and-Execute (LangGraph, gpt-4o-mini + Tavily)

Trace file: data/example/run_20251223_020529.otel.jsonl
Summary: 76 spans, 25 call_llm spans, 51 invoke_agent spans, 70,197 input tokens, 2,974 output tokens, ~138.7s wall time, max replanner retry attempt 24, final status ERROR (GraphRecursionError).

{
  "trace_id": "01a0b8e663cdb64112fe4c44291b0e33",
  "span_id": "000c8e48d640b5e7",
  "name": "plan_execute.call_llm.planner",
  "agent_name": "plan_and_execute_benchmark.llm.planner",
  "start_time": 1766455529802170196,
  "end_time": 1766455532579976422,
  "duration_ns": 2777806226,
  "attributes": {
    "gen_ai.operation.name": "call_llm",
    "langgraph.phase": "planner",
    "gen_ai.request.model": "gpt-4o-mini",
    "gen_ai.usage.total_tokens": 251,
    "gen_ai.llm.call.count": 1,
    "agent.output.useless": false
  }
}
{
  "trace_id": "01a0b8e663cdb64112fe4c44291b0e33",
  "span_id": "093c5ef419cb8831",
  "name": "plan_execute.node.planner",
  "agent_name": "plan_and_execute_benchmark.node.planner",
  "start_time": 1766455529802089867,
  "end_time": 1766455532583673105,
  "duration_ns": 2781583238,
  "attributes": {
    "gen_ai.operation.name": "invoke_agent",
    "plan_execute.node": "planner",
    "plan_execute.plan.step_count": 4,
    "plan_execute.plan.preview": [
      "Identify the time period of the Russian Civil War...",
      "Determine the key events leading to the defeat...",
      "Research the date when the Russian Civil War effectively ended..."
    ]
  }
}
{
  "trace_id": "01a0b8e663cdb64112fe4c44291b0e33",
  "span_id": "7def16b387d7b677",
  "name": "plan_execute.run",
  "agent_name": "plan_and_execute_benchmark.run",
  "start_time": 1766455529800856493,
  "end_time": 1766455668502872238,
  "duration_ns": 138702015745,
  "status": {
    "status_code": "ERROR",
    "description": "GraphRecursionError: Recursion limit of 50 reached without hitting a stop condition..."
  },
  "attributes": {
    "gen_ai.operation.name": "invoke_agent",
    "run.outcome": "failure",
    "run.judgement": "wrong"
  }
}

Citation

If you find MAESTRO or its dataset useful in your research, please consider citing the following paper:

@misc{maestro,
      title={MAESTRO: Multi-Agent Evaluation Suite for Testing, Reliability, and Observability},
      author={Tie Ma and Yixi Chen and Vaastav Anand and Alessandro Cornacchia and Amândio R. Faustino and Guanheng Liu and Shan Zhang and Hongbin Luo and Suhaib A. Fahmy and Zafar A. Qazi and Marco Canini},
      year={2026},
      eprint={2601.00481},
      archivePrefix={arXiv},
      primaryClass={cs.NI},
      url={https://arxiv.org/abs/2601.00481},
}

Developing MAESTRO

git clone git@github.com:sands-lab/maestro.git
cd maestro
uv sync
# Install pre-commit hooks
uv run -- pre-commit install

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github		.github
data/example		data/example
docs/imgs		docs/imgs
examples		examples
plot		plot
src/maestro		src/maestro
tools		tools
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MAESTRO: Multi-Agent Evaluation Suite for Testing, Reliability, and Observability

Available MAS examples

Datasets

Trace Examples

Plan-and-Execute (LangGraph, gpt-4o-mini + Tavily)

Citation

Developing MAESTRO

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MAESTRO: Multi-Agent Evaluation Suite for Testing, Reliability, and Observability

Available MAS examples

Datasets

Trace Examples

Plan-and-Execute (LangGraph, gpt-4o-mini + Tavily)

Citation

Developing MAESTRO

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages