[eval] Routing & context benchmark harness with gold-standard dataset

## Problem

There is no way to measure routing quality or context pipeline efficiency. When changes are made to the router, tree builder, scoring logic, or context pipeline, there is no automated way to detect regressions or validate improvements. Most agent libraries similarly lack this — building one would be a genuine differentiator.

Without this:
- No data for README claims ("500→2K tokens saved")
- No regression detection when pipeline logic changes
- No comparison baseline for evaluating config changes (beam width, dedup threshold, budget)

## Proposal

Create a benchmark harness covering both **routing accuracy** and **context pipeline efficiency** with proxy metrics (no LLM calls needed):

### 1. Gold-standard dataset (`benchmarks/routing_gold.json`)

100 queries mapped to correct tool IDs, built from the existing 83-item sample catalog:

```json
[
  {"query": "send a reminder email", "expected": ["email_send", "email_draft"], "tags": ["email"]},
  {"query": "query the database", "expected": ["db_query", "db_schema"], "tags": ["database"]},
  ...
]
```

### 2. Benchmark script (`benchmarks/benchmark.py`)

**Routing metrics:**
- **Precision@k** — fraction of top-k results that are correct
- **Recall@k** — fraction of correct results found in top-k
- **MRR** (Mean Reciprocal Rank) — how high the first correct result ranks
- **Latency** — p50/p95/p99 route times
- **tools_exposed** (top-k count), **beam_paths_explored**

**Context pipeline metrics:**
- **prompt_tokens** and **budget_utilization_pct**
- **items_included / dropped / dedup_removed**
- **artifacts_created** and **avg_compaction_ratio** (raw_size / summary_size)

Tests 3 catalog sizes (50, 200, 1000 items via `generate_sample_catalog`).

### 3. Scenario files (`benchmarks/scenarios/`)

- `short_conversation.jsonl` (5 turns, 3 tools)
- `long_conversation.jsonl` (50 turns, 10 tools, 2 large results)
- `large_catalog.jsonl` (100-tool catalog, routing-heavy)

### 4. CI integration (non-gating)

- Add `make benchmark` target
- Run on CI as an informational step (does not gate merges)
- Output results as JSON for tracking over time
- Deterministic: same input → same output (seeded)
- Runs in <10s without LLM calls

## Acceptance Criteria

- [ ] `benchmarks/routing_gold.json` with ≥50 queries and expected results
- [ ] `benchmarks/benchmark.py` computes routing metrics (precision@k, recall@k, MRR, latency)
- [ ] `benchmarks/benchmark.py` computes context metrics (prompt_tokens, budget_utilization, dedup, compaction)
- [ ] Tests 3 catalog sizes (50, 200, 1000 items)
- [ ] 3 scenario files in `benchmarks/scenarios/`
- [ ] `make benchmark` runs the benchmark and prints results
- [ ] Outputs JSON results to `benchmarks/results/latest.json`
- [ ] Outputs human-readable summary table to stdout
- [ ] Baseline metrics established and documented in `benchmarks/README.md`
- [ ] CI workflow runs benchmark as non-gating informational step
- [ ] Deterministic (seeded generation, sorted output)
- [ ] No new runtime dependencies

## File Paths

- `benchmarks/routing_gold.json` (new)
- `benchmarks/benchmark.py` (new)
- `benchmarks/scenarios/short_conversation.jsonl` (new)
- `benchmarks/scenarios/long_conversation.jsonl` (new)
- `benchmarks/scenarios/large_catalog.jsonl` (new)
- `benchmarks/README.md` (new)
- `benchmarks/results/.gitkeep` (new, results git-ignored)
- `Makefile` (edit — add `benchmark` target)
- `.github/workflows/ci.yml` (edit — add non-gating benchmark step)
- `.gitignore` (edit — add `benchmarks/results/*.json`)

## Cross-references

- Related to #12 (full eval harness with EvalDataset API, CLI subcommand — builds on this)
- Consolidates #88 (proxy-metric benchmark harness)
- Related to #86 (Interop/Policy-Engine epic)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[eval] Routing & context benchmark harness with gold-standard dataset #119

Problem

Proposal

1. Gold-standard dataset (`benchmarks/routing_gold.json`)

2. Benchmark script (`benchmarks/benchmark.py`)

3. Scenario files (`benchmarks/scenarios/`)

4. CI integration (non-gating)

Acceptance Criteria

File Paths

Cross-references

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[eval] Routing & context benchmark harness with gold-standard dataset #119

Description

Problem

Proposal

1. Gold-standard dataset (benchmarks/routing_gold.json)

2. Benchmark script (benchmarks/benchmark.py)

3. Scenario files (benchmarks/scenarios/)

4. CI integration (non-gating)

Acceptance Criteria

File Paths

Cross-references

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

1. Gold-standard dataset (`benchmarks/routing_gold.json`)

2. Benchmark script (`benchmarks/benchmark.py`)

3. Scenario files (`benchmarks/scenarios/`)