Skip to content

sands-lab/maestro

Repository files navigation

MAESTRO logo

MAESTRO: Multi-Agent Evaluation Suite for Testing, Reliability, and Observability

MAESTRO is a framework-agnostic evaluation suite for LLM-based multi-agent systems (MAS), developed to benchmark, observe, and analyze MAS behavior from a systems perspective.

The stochastic and dynamic execution of LLM-powered MAS makes it difficult to debug and analyze system performance, understand resource usage per component, and determine whether agents should be split or optimized for efficiency. MAESTRO addresses this by providing a representative set of ready-to-run scenarios, a unified evaluation interface, and framework-agnostic metrics to measure performance, reliability, and system-level behavior across heterogeneous MAS setups.

What MAESTRO provides:

  • A repository of representative MAS instances
  • A unified configuration interface for consistent cross-framework MAS evaluation
  • Execution traces and system-level signals including latency, network utilization, and failures

MAESTRO architecture overview

Available MAS examples

MAS examples can be found under the examples directory. You can run multiple benchmarks using the run_benchmarks.py script inside that folder. For example-specific instructions, click the links in the table below to open each example's README.

Example App. Field Framework Int. Type #Agents #Tools Data In Data Out
Fin. Analyzer Finance MCP-Agent Correct 6 1 Artifacts Opn-End
Img. Scr. Creativity ADK Debate 4 2 Artifacts Cls-Form
Marketing Marketing ADK Coord. 4 1 Artifacts Opn-End
Brand SEO Marketing ADK Coord. 4 10 Artifacts Opn-End
Content Creat. Creativity ADK Plan 4 1 Artifacts Opn-End
Mag.-One Cross-domain Autogen Plan 4 0 Artifacts Opn-End
Stock Res. Finance Autogen Coord. 4 2 Artifacts Opn-End
Travel Plan. Travel Autogen Coord. 4 0 Artifacts Opn-End
ToT Cross-domain LangGraph Debate 3 0 Artifacts Cls-Form
CRAG Cross-domain LangGraph Coord. 5 2 Datasets Opn-End
Plan & Exec. Cross-domain LangGraph Plan 3 1 Datasets Opn-End
LATS Cross-domain LangGraph Plan 3 1 Datasets Opn-End

Datasets

We open source our collected datasets for MAS execution and resource usage, which is available at https://huggingface.co/datasets/kaust-generative-ai/maestro-mas-benchmark

Trace Examples

Plan-and-Execute (LangGraph, gpt-4o-mini + Tavily)

  • Trace file: data/example/run_20251223_020529.otel.jsonl
  • Summary: 76 spans, 25 call_llm spans, 51 invoke_agent spans, 70,197 input tokens, 2,974 output tokens, ~138.7s wall time, max replanner retry attempt 24, final status ERROR (GraphRecursionError).
{
  "trace_id": "01a0b8e663cdb64112fe4c44291b0e33",
  "span_id": "000c8e48d640b5e7",
  "name": "plan_execute.call_llm.planner",
  "agent_name": "plan_and_execute_benchmark.llm.planner",
  "start_time": 1766455529802170196,
  "end_time": 1766455532579976422,
  "duration_ns": 2777806226,
  "attributes": {
    "gen_ai.operation.name": "call_llm",
    "langgraph.phase": "planner",
    "gen_ai.request.model": "gpt-4o-mini",
    "gen_ai.usage.total_tokens": 251,
    "gen_ai.llm.call.count": 1,
    "agent.output.useless": false
  }
}
{
  "trace_id": "01a0b8e663cdb64112fe4c44291b0e33",
  "span_id": "093c5ef419cb8831",
  "name": "plan_execute.node.planner",
  "agent_name": "plan_and_execute_benchmark.node.planner",
  "start_time": 1766455529802089867,
  "end_time": 1766455532583673105,
  "duration_ns": 2781583238,
  "attributes": {
    "gen_ai.operation.name": "invoke_agent",
    "plan_execute.node": "planner",
    "plan_execute.plan.step_count": 4,
    "plan_execute.plan.preview": [
      "Identify the time period of the Russian Civil War...",
      "Determine the key events leading to the defeat...",
      "Research the date when the Russian Civil War effectively ended..."
    ]
  }
}
{
  "trace_id": "01a0b8e663cdb64112fe4c44291b0e33",
  "span_id": "7def16b387d7b677",
  "name": "plan_execute.run",
  "agent_name": "plan_and_execute_benchmark.run",
  "start_time": 1766455529800856493,
  "end_time": 1766455668502872238,
  "duration_ns": 138702015745,
  "status": {
    "status_code": "ERROR",
    "description": "GraphRecursionError: Recursion limit of 50 reached without hitting a stop condition..."
  },
  "attributes": {
    "gen_ai.operation.name": "invoke_agent",
    "run.outcome": "failure",
    "run.judgement": "wrong"
  }
}

Citation

If you find MAESTRO or its dataset useful in your research, please consider citing the following paper:

@misc{maestro,
      title={MAESTRO: Multi-Agent Evaluation Suite for Testing, Reliability, and Observability},
      author={Tie Ma and Yixi Chen and Vaastav Anand and Alessandro Cornacchia and Amândio R. Faustino and Guanheng Liu and Shan Zhang and Hongbin Luo and Suhaib A. Fahmy and Zafar A. Qazi and Marco Canini},
      year={2026},
      eprint={2601.00481},
      archivePrefix={arXiv},
      primaryClass={cs.NI},
      url={https://arxiv.org/abs/2601.00481},
}

Developing MAESTRO

git clone git@github.com:sands-lab/maestro.git
cd maestro
uv sync
# Install pre-commit hooks
uv run -- pre-commit install

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages