Agent Comms

A protocol and evaluation framework for multi-agent systems.

The Problem

Building agentic AI is easy. Knowing if it works is hard.

Without systematic evaluation:

You ship based on vibes ("it seems to work")
Regressions sneak in with every prompt change
You can't compare approaches objectively
Production issues surprise you

This repo implements evaluation-first development for LLM agents.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Evaluation Pipeline                       │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  TestCase         Dataset          EvalRun        Comparison │
│  ┌─────────┐     ┌─────────┐     ┌─────────┐    ┌─────────┐ │
│  │ input   │     │ cases[] │     │ results │    │ baseline│ │
│  │ expected│ ──▶ │ name    │ ──▶ │ metrics │ ──▶│ vs exp  │ │
│  │ eval_fn │     │ tags    │     │ pass/fail│    │ verdict │ │
│  └─────────┘     └─────────┘     └─────────┘    └─────────┘ │
│                                                              │
│  YAML-defined    Batch runner    Aggregates      Regression  │
│  test cases      with tracing    + p95 latency   detection   │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Core Concepts

1. Test Cases as YAML

Define expectations declaratively:

cases:
  - id: get-nuclearn-context
    description: Get full context for opportunity
    agent: pipeline
    capability: get_opportunity_context
    input_data:
      company: nuclearn
    expected_keys:
      - summary
      - compensation
      - tech_stack
      - fit_assessment
    eval_type: keys

2. Evaluation Runner

Run test suites and get metrics:

runner = EvalRunner()

baseline = await runner.run(
    dataset=Dataset.from_yaml("datasets/pipeline-basic.yaml"),
    agent_fn=call_agent,
    experiment="baseline"
)

# Results:
# - Pass rate: 87.5%
# - Avg latency: 234ms
# - p95 latency: 512ms

3. Baseline Comparison

Compare experiments objectively:

comparison = runner.compare("baseline", "v2-with-cache")

print(comparison.summary)
# Pass rate: 87.5% -> 93.8% (+6.3%)
# Latency: 234ms -> 189ms (-45ms)
# Verdict: BETTER
# Regressions: []
# Improvements: [check-fit-unknown]

4. Regression Detection

Catch regressions before they ship:

if comparison.regressions:
    print(f"BLOCKED: {len(comparison.regressions)} regressions")
    for case_id in comparison.regressions:
        print(f"  - {case_id} was passing, now failing")

Observability

Integrates with Arize Phoenix for production tracing:

from eval.phoenix_integration import setup_tracing, trace_agent_call

setup_tracing(project_name="my-agents")

with trace_agent_call("pipeline", "get_opportunity") as span:
    span.set_attribute("input.company", "nuclearn")
    result = await agent.call(...)
    span.set_attribute("output.found", result["found"])

Traces flow to Phoenix for:

Latency analysis
Error debugging
LLM-as-Judge evaluations
Production monitoring

Project Structure

agent-comms/
├── agents/              # Agent implementations
│   └── pipeline/        # Pipeline management agent
├── protocol/            # A2A communication protocol
│   ├── client.py        # Agent client
│   ├── server.py        # Agent server
│   └── types.py         # Protocol types
├── eval/                # Evaluation framework
│   ├── runner.py        # Test runner + comparison
│   ├── tracer.py        # Simple file-based tracing
│   └── phoenix_integration.py  # Arize Phoenix integration
├── datasets/            # YAML test datasets
├── eval_results/        # Saved evaluation runs
├── traces/              # Execution traces
└── examples/            # Usage examples
    ├── 01_minimal_agent/
    └── 02_eval_driven/

Installation

# Core
pip install -e .

# With evaluation/observability
pip install -e ".[eval]"

# Development
pip install -e ".[dev]"

Usage

Run evaluations

cd examples/02_eval_driven
python run_eval.py

Start Phoenix UI (optional)

phoenix serve
# Open http://localhost:6006

Why This Approach

The problem with spot-checking: You run a few examples, they work, you ship. Then edge cases break in production.

The problem with traditional tests: LLM outputs are non-deterministic. Exact string matching doesn't work.

This approach:

Define expected behaviors (not exact outputs)
Run against representative datasets
Compare experiments to baselines
Block regressions automatically
Trace everything for debugging

It's TDD for AI systems.

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agent Comms

The Problem

Architecture

Core Concepts

1. Test Cases as YAML

2. Evaluation Runner

3. Baseline Comparison

4. Regression Detection

Observability

Project Structure

Installation

Usage

Run evaluations

Start Phoenix UI (optional)

Why This Approach

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
agents		agents
datasets		datasets
eval		eval
eval_results		eval_results
examples		examples
protocol		protocol
traces		traces
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Agent Comms

The Problem

Architecture

Core Concepts

1. Test Cases as YAML

2. Evaluation Runner

3. Baseline Comparison

4. Regression Detection

Observability

Project Structure

Installation

Usage

Run evaluations

Start Phoenix UI (optional)

Why This Approach

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages