Skip to content

ErikCohenDev/agent-comms

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Agent Comms

A protocol and evaluation framework for multi-agent systems.

The Problem

Building agentic AI is easy. Knowing if it works is hard.

Without systematic evaluation:

  • You ship based on vibes ("it seems to work")
  • Regressions sneak in with every prompt change
  • You can't compare approaches objectively
  • Production issues surprise you

This repo implements evaluation-first development for LLM agents.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Evaluation Pipeline                       │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  TestCase         Dataset          EvalRun        Comparison │
│  ┌─────────┐     ┌─────────┐     ┌─────────┐    ┌─────────┐ │
│  │ input   │     │ cases[] │     │ results │    │ baseline│ │
│  │ expected│ ──▶ │ name    │ ──▶ │ metrics │ ──▶│ vs exp  │ │
│  │ eval_fn │     │ tags    │     │ pass/fail│    │ verdict │ │
│  └─────────┘     └─────────┘     └─────────┘    └─────────┘ │
│                                                              │
│  YAML-defined    Batch runner    Aggregates      Regression  │
│  test cases      with tracing    + p95 latency   detection   │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Core Concepts

1. Test Cases as YAML

Define expectations declaratively:

cases:
  - id: get-nuclearn-context
    description: Get full context for opportunity
    agent: pipeline
    capability: get_opportunity_context
    input_data:
      company: nuclearn
    expected_keys:
      - summary
      - compensation
      - tech_stack
      - fit_assessment
    eval_type: keys

2. Evaluation Runner

Run test suites and get metrics:

runner = EvalRunner()

baseline = await runner.run(
    dataset=Dataset.from_yaml("datasets/pipeline-basic.yaml"),
    agent_fn=call_agent,
    experiment="baseline"
)

# Results:
# - Pass rate: 87.5%
# - Avg latency: 234ms
# - p95 latency: 512ms

3. Baseline Comparison

Compare experiments objectively:

comparison = runner.compare("baseline", "v2-with-cache")

print(comparison.summary)
# Pass rate: 87.5% -> 93.8% (+6.3%)
# Latency: 234ms -> 189ms (-45ms)
# Verdict: BETTER
# Regressions: []
# Improvements: [check-fit-unknown]

4. Regression Detection

Catch regressions before they ship:

if comparison.regressions:
    print(f"BLOCKED: {len(comparison.regressions)} regressions")
    for case_id in comparison.regressions:
        print(f"  - {case_id} was passing, now failing")

Observability

Integrates with Arize Phoenix for production tracing:

from eval.phoenix_integration import setup_tracing, trace_agent_call

setup_tracing(project_name="my-agents")

with trace_agent_call("pipeline", "get_opportunity") as span:
    span.set_attribute("input.company", "nuclearn")
    result = await agent.call(...)
    span.set_attribute("output.found", result["found"])

Traces flow to Phoenix for:

  • Latency analysis
  • Error debugging
  • LLM-as-Judge evaluations
  • Production monitoring

Project Structure

agent-comms/
├── agents/              # Agent implementations
│   └── pipeline/        # Pipeline management agent
├── protocol/            # A2A communication protocol
│   ├── client.py        # Agent client
│   ├── server.py        # Agent server
│   └── types.py         # Protocol types
├── eval/                # Evaluation framework
│   ├── runner.py        # Test runner + comparison
│   ├── tracer.py        # Simple file-based tracing
│   └── phoenix_integration.py  # Arize Phoenix integration
├── datasets/            # YAML test datasets
├── eval_results/        # Saved evaluation runs
├── traces/              # Execution traces
└── examples/            # Usage examples
    ├── 01_minimal_agent/
    └── 02_eval_driven/

Installation

# Core
pip install -e .

# With evaluation/observability
pip install -e ".[eval]"

# Development
pip install -e ".[dev]"

Usage

Run evaluations

cd examples/02_eval_driven
python run_eval.py

Start Phoenix UI (optional)

phoenix serve
# Open http://localhost:6006

Why This Approach

The problem with spot-checking: You run a few examples, they work, you ship. Then edge cases break in production.

The problem with traditional tests: LLM outputs are non-deterministic. Exact string matching doesn't work.

This approach:

  • Define expected behaviors (not exact outputs)
  • Run against representative datasets
  • Compare experiments to baselines
  • Block regressions automatically
  • Trace everything for debugging

It's TDD for AI systems.

License

MIT

About

Evaluation framework for LLM agents. Regression detection, baseline comparison, LLM-as-Judge patterns

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages