Skip to content

feat(cli): add qmd bench for search quality benchmarks#470

Open
jmilinovich wants to merge 1 commit intotobi:mainfrom
jmilinovich:feat/bench-command
Open

feat(cli): add qmd bench for search quality benchmarks#470
jmilinovich wants to merge 1 commit intotobi:mainfrom
jmilinovich:feat/bench-command

Conversation

@jmilinovich
Copy link
Copy Markdown

Summary

  • Adds a qmd bench <fixture.json> command that measures search quality across all four backends (BM25, vector, hybrid no-rerank, full pipeline)
  • Computes precision@k, recall, MRR (Mean Reciprocal Rank), F1, and latency per query per backend
  • Ships with an example fixture against the existing eval-docs test collection
  • Includes 16 unit tests for the scoring functions

Motivation

QMD has test/eval-harness.ts for CLI-based eval, but no SDK-based benchmark that compares all backends side-by-side or tracks quality metrics over time. After building a personal benchmark suite for my vault search, Tobi suggested contributing it upstream.

This is primarily a regression testing tool — users create fixture files for their own vaults to catch quality regressions after config changes, reindexing, or model updates. The distinction matters: user-created fixtures measure "does QMD still return what I expect?", not absolute retrieval quality.

Fixture format

{
  "description": "My vault benchmark",
  "version": 1,
  "collection": "my-collection",
  "queries": [
    {
      "id": "exact-keyword",
      "query": "API versioning",
      "type": "exact",
      "description": "Direct keyword match",
      "expected_files": ["api-design-principles.md"],
      "expected_in_top_k": 1
    }
  ]
}

Usage

# Human-readable table output
qmd bench src/bench/fixtures/example.json

# JSON output for CI / trend tracking
qmd bench src/bench/fixtures/example.json --json

# Filter to specific collection
qmd bench fixture.json -c my-collection

Output

Query                     Backend  P@k    Recall  MRR    F1       ms
----------------------------------------------------------------------
exact-api                 bm25    1.00    1.00   1.00   1.00       3ms
exact-api                 vector  1.00    1.00   1.00   1.00      24ms
exact-api                 hybrid  1.00    1.00   1.00   1.00      26ms
exact-api                 full    1.00    1.00   1.00   1.00     840ms
...

Summary:
  bm25     P@k= 0.800 Recall= 0.900 MRR= 0.850 F1= 0.844 Avg=4ms
  vector   P@k= 0.700 Recall= 0.800 MRR= 0.750 F1= 0.744 Avg=28ms
  hybrid   P@k= 0.900 Recall= 0.950 MRR= 0.920 F1= 0.924 Avg=30ms
  full     P@k= 0.950 Recall= 0.950 MRR= 0.960 F1= 0.950 Avg=850ms

Test plan

  • npx vitest run test/bench-score.test.ts — 16 unit tests for scoring functions (normalizePath, pathsMatch, scoreResults)
  • Manual: qmd bench src/bench/fixtures/example.json against a local eval-docs collection
  • Manual: qmd bench src/bench/fixtures/example.json --json produces valid JSON

🤖 Generated with Claude Code

Adds a benchmark harness that measures search quality across backends.
Given a fixture file with queries and expected results, it runs each
query through BM25, vector, hybrid (no rerank), and full pipeline,
then reports precision@k, recall, MRR, F1, and latency.

This is primarily a regression testing tool — users create fixtures
for their own vaults to catch quality regressions after config or
index changes. Ships with an example fixture against the eval-docs
test collection to demonstrate the format.

New files:
  src/bench/bench.ts       — main runner
  src/bench/score.ts       — precision, recall, MRR, F1, path matching
  src/bench/types.ts       — fixture and result types
  src/bench/fixtures/      — example fixture
  test/bench-score.test.ts — unit tests for scoring (16 tests)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant