Skip to content

[eval] Routing & context benchmark harness with gold-standard dataset #119

@dgenio

Description

@dgenio

Problem

There is no way to measure routing quality or context pipeline efficiency. When changes are made to the router, tree builder, scoring logic, or context pipeline, there is no automated way to detect regressions or validate improvements. Most agent libraries similarly lack this — building one would be a genuine differentiator.

Without this:

  • No data for README claims ("500→2K tokens saved")
  • No regression detection when pipeline logic changes
  • No comparison baseline for evaluating config changes (beam width, dedup threshold, budget)

Proposal

Create a benchmark harness covering both routing accuracy and context pipeline efficiency with proxy metrics (no LLM calls needed):

1. Gold-standard dataset (benchmarks/routing_gold.json)

100 queries mapped to correct tool IDs, built from the existing 83-item sample catalog:

[
  {"query": "send a reminder email", "expected": ["email_send", "email_draft"], "tags": ["email"]},
  {"query": "query the database", "expected": ["db_query", "db_schema"], "tags": ["database"]},
  ...
]

2. Benchmark script (benchmarks/benchmark.py)

Routing metrics:

  • Precision@k — fraction of top-k results that are correct
  • Recall@k — fraction of correct results found in top-k
  • MRR (Mean Reciprocal Rank) — how high the first correct result ranks
  • Latency — p50/p95/p99 route times
  • tools_exposed (top-k count), beam_paths_explored

Context pipeline metrics:

  • prompt_tokens and budget_utilization_pct
  • items_included / dropped / dedup_removed
  • artifacts_created and avg_compaction_ratio (raw_size / summary_size)

Tests 3 catalog sizes (50, 200, 1000 items via generate_sample_catalog).

3. Scenario files (benchmarks/scenarios/)

  • short_conversation.jsonl (5 turns, 3 tools)
  • long_conversation.jsonl (50 turns, 10 tools, 2 large results)
  • large_catalog.jsonl (100-tool catalog, routing-heavy)

4. CI integration (non-gating)

  • Add make benchmark target
  • Run on CI as an informational step (does not gate merges)
  • Output results as JSON for tracking over time
  • Deterministic: same input → same output (seeded)
  • Runs in <10s without LLM calls

Acceptance Criteria

  • benchmarks/routing_gold.json with ≥50 queries and expected results
  • benchmarks/benchmark.py computes routing metrics (precision@k, recall@k, MRR, latency)
  • benchmarks/benchmark.py computes context metrics (prompt_tokens, budget_utilization, dedup, compaction)
  • Tests 3 catalog sizes (50, 200, 1000 items)
  • 3 scenario files in benchmarks/scenarios/
  • make benchmark runs the benchmark and prints results
  • Outputs JSON results to benchmarks/results/latest.json
  • Outputs human-readable summary table to stdout
  • Baseline metrics established and documented in benchmarks/README.md
  • CI workflow runs benchmark as non-gating informational step
  • Deterministic (seeded generation, sorted output)
  • No new runtime dependencies

File Paths

  • benchmarks/routing_gold.json (new)
  • benchmarks/benchmark.py (new)
  • benchmarks/scenarios/short_conversation.jsonl (new)
  • benchmarks/scenarios/long_conversation.jsonl (new)
  • benchmarks/scenarios/large_catalog.jsonl (new)
  • benchmarks/README.md (new)
  • benchmarks/results/.gitkeep (new, results git-ignored)
  • Makefile (edit — add benchmark target)
  • .github/workflows/ci.yml (edit — add non-gating benchmark step)
  • .gitignore (edit — add benchmarks/results/*.json)

Cross-references

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/contextContext engine: manager, pipeline, firewallarea/evalEvaluation, benchmarking, quality measurementarea/routingRouting engine: catalog, graph, router, cardscomplexity/averageStandard effort, moderate familiarity neededenhancementNew feature or requestmilestone/v0.5v0.5 Advanced routing & scalepriority/highHigh priority — closes a critical gap

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions