-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Labels
area/contextContext engine: manager, pipeline, firewallContext engine: manager, pipeline, firewallarea/evalEvaluation, benchmarking, quality measurementEvaluation, benchmarking, quality measurementarea/routingRouting engine: catalog, graph, router, cardsRouting engine: catalog, graph, router, cardscomplexity/averageStandard effort, moderate familiarity neededStandard effort, moderate familiarity neededenhancementNew feature or requestNew feature or requestmilestone/v0.5v0.5 Advanced routing & scalev0.5 Advanced routing & scalepriority/highHigh priority — closes a critical gapHigh priority — closes a critical gap
Description
Problem
There is no way to measure routing quality or context pipeline efficiency. When changes are made to the router, tree builder, scoring logic, or context pipeline, there is no automated way to detect regressions or validate improvements. Most agent libraries similarly lack this — building one would be a genuine differentiator.
Without this:
- No data for README claims ("500→2K tokens saved")
- No regression detection when pipeline logic changes
- No comparison baseline for evaluating config changes (beam width, dedup threshold, budget)
Proposal
Create a benchmark harness covering both routing accuracy and context pipeline efficiency with proxy metrics (no LLM calls needed):
1. Gold-standard dataset (benchmarks/routing_gold.json)
100 queries mapped to correct tool IDs, built from the existing 83-item sample catalog:
[
{"query": "send a reminder email", "expected": ["email_send", "email_draft"], "tags": ["email"]},
{"query": "query the database", "expected": ["db_query", "db_schema"], "tags": ["database"]},
...
]2. Benchmark script (benchmarks/benchmark.py)
Routing metrics:
- Precision@k — fraction of top-k results that are correct
- Recall@k — fraction of correct results found in top-k
- MRR (Mean Reciprocal Rank) — how high the first correct result ranks
- Latency — p50/p95/p99 route times
- tools_exposed (top-k count), beam_paths_explored
Context pipeline metrics:
- prompt_tokens and budget_utilization_pct
- items_included / dropped / dedup_removed
- artifacts_created and avg_compaction_ratio (raw_size / summary_size)
Tests 3 catalog sizes (50, 200, 1000 items via generate_sample_catalog).
3. Scenario files (benchmarks/scenarios/)
short_conversation.jsonl(5 turns, 3 tools)long_conversation.jsonl(50 turns, 10 tools, 2 large results)large_catalog.jsonl(100-tool catalog, routing-heavy)
4. CI integration (non-gating)
- Add
make benchmarktarget - Run on CI as an informational step (does not gate merges)
- Output results as JSON for tracking over time
- Deterministic: same input → same output (seeded)
- Runs in <10s without LLM calls
Acceptance Criteria
-
benchmarks/routing_gold.jsonwith ≥50 queries and expected results -
benchmarks/benchmark.pycomputes routing metrics (precision@k, recall@k, MRR, latency) -
benchmarks/benchmark.pycomputes context metrics (prompt_tokens, budget_utilization, dedup, compaction) - Tests 3 catalog sizes (50, 200, 1000 items)
- 3 scenario files in
benchmarks/scenarios/ -
make benchmarkruns the benchmark and prints results - Outputs JSON results to
benchmarks/results/latest.json - Outputs human-readable summary table to stdout
- Baseline metrics established and documented in
benchmarks/README.md - CI workflow runs benchmark as non-gating informational step
- Deterministic (seeded generation, sorted output)
- No new runtime dependencies
File Paths
benchmarks/routing_gold.json(new)benchmarks/benchmark.py(new)benchmarks/scenarios/short_conversation.jsonl(new)benchmarks/scenarios/long_conversation.jsonl(new)benchmarks/scenarios/large_catalog.jsonl(new)benchmarks/README.md(new)benchmarks/results/.gitkeep(new, results git-ignored)Makefile(edit — addbenchmarktarget).github/workflows/ci.yml(edit — add non-gating benchmark step).gitignore(edit — addbenchmarks/results/*.json)
Cross-references
- Related to [eval] Add routing and context evaluation harness with gold-standard datasets #12 (full eval harness with EvalDataset API, CLI subcommand — builds on this)
- Consolidates [eval] Proxy-metric benchmark harness for routing + context pipeline #88 (proxy-metric benchmark harness)
- Related to [epic] Interop / Policy-Engine positioning — direction + milestone mapping #86 (Interop/Policy-Engine epic)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
area/contextContext engine: manager, pipeline, firewallContext engine: manager, pipeline, firewallarea/evalEvaluation, benchmarking, quality measurementEvaluation, benchmarking, quality measurementarea/routingRouting engine: catalog, graph, router, cardsRouting engine: catalog, graph, router, cardscomplexity/averageStandard effort, moderate familiarity neededStandard effort, moderate familiarity neededenhancementNew feature or requestNew feature or requestmilestone/v0.5v0.5 Advanced routing & scalev0.5 Advanced routing & scalepriority/highHigh priority — closes a critical gapHigh priority — closes a critical gap