A protocol and evaluation framework for multi-agent systems.
Building agentic AI is easy. Knowing if it works is hard.
Without systematic evaluation:
- You ship based on vibes ("it seems to work")
- Regressions sneak in with every prompt change
- You can't compare approaches objectively
- Production issues surprise you
This repo implements evaluation-first development for LLM agents.
┌─────────────────────────────────────────────────────────────┐
│ Evaluation Pipeline │
├─────────────────────────────────────────────────────────────┤
│ │
│ TestCase Dataset EvalRun Comparison │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ input │ │ cases[] │ │ results │ │ baseline│ │
│ │ expected│ ──▶ │ name │ ──▶ │ metrics │ ──▶│ vs exp │ │
│ │ eval_fn │ │ tags │ │ pass/fail│ │ verdict │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │
│ YAML-defined Batch runner Aggregates Regression │
│ test cases with tracing + p95 latency detection │
│ │
└─────────────────────────────────────────────────────────────┘
Define expectations declaratively:
cases:
- id: get-nuclearn-context
description: Get full context for opportunity
agent: pipeline
capability: get_opportunity_context
input_data:
company: nuclearn
expected_keys:
- summary
- compensation
- tech_stack
- fit_assessment
eval_type: keysRun test suites and get metrics:
runner = EvalRunner()
baseline = await runner.run(
dataset=Dataset.from_yaml("datasets/pipeline-basic.yaml"),
agent_fn=call_agent,
experiment="baseline"
)
# Results:
# - Pass rate: 87.5%
# - Avg latency: 234ms
# - p95 latency: 512msCompare experiments objectively:
comparison = runner.compare("baseline", "v2-with-cache")
print(comparison.summary)
# Pass rate: 87.5% -> 93.8% (+6.3%)
# Latency: 234ms -> 189ms (-45ms)
# Verdict: BETTER
# Regressions: []
# Improvements: [check-fit-unknown]Catch regressions before they ship:
if comparison.regressions:
print(f"BLOCKED: {len(comparison.regressions)} regressions")
for case_id in comparison.regressions:
print(f" - {case_id} was passing, now failing")Integrates with Arize Phoenix for production tracing:
from eval.phoenix_integration import setup_tracing, trace_agent_call
setup_tracing(project_name="my-agents")
with trace_agent_call("pipeline", "get_opportunity") as span:
span.set_attribute("input.company", "nuclearn")
result = await agent.call(...)
span.set_attribute("output.found", result["found"])Traces flow to Phoenix for:
- Latency analysis
- Error debugging
- LLM-as-Judge evaluations
- Production monitoring
agent-comms/
├── agents/ # Agent implementations
│ └── pipeline/ # Pipeline management agent
├── protocol/ # A2A communication protocol
│ ├── client.py # Agent client
│ ├── server.py # Agent server
│ └── types.py # Protocol types
├── eval/ # Evaluation framework
│ ├── runner.py # Test runner + comparison
│ ├── tracer.py # Simple file-based tracing
│ └── phoenix_integration.py # Arize Phoenix integration
├── datasets/ # YAML test datasets
├── eval_results/ # Saved evaluation runs
├── traces/ # Execution traces
└── examples/ # Usage examples
├── 01_minimal_agent/
└── 02_eval_driven/
# Core
pip install -e .
# With evaluation/observability
pip install -e ".[eval]"
# Development
pip install -e ".[dev]"cd examples/02_eval_driven
python run_eval.pyphoenix serve
# Open http://localhost:6006The problem with spot-checking: You run a few examples, they work, you ship. Then edge cases break in production.
The problem with traditional tests: LLM outputs are non-deterministic. Exact string matching doesn't work.
This approach:
- Define expected behaviors (not exact outputs)
- Run against representative datasets
- Compare experiments to baselines
- Block regressions automatically
- Trace everything for debugging
It's TDD for AI systems.
MIT