feat(cli): add `qmd bench` for search quality benchmarks by jmilinovich · Pull Request #470 · tobi/qmd

jmilinovich · 2026-03-26T22:47:54Z

Summary

Adds a qmd bench <fixture.json> command that measures search quality across all four backends (BM25, vector, hybrid no-rerank, full pipeline)
Computes precision@k, recall, MRR (Mean Reciprocal Rank), F1, and latency per query per backend
Ships with an example fixture against the existing eval-docs test collection
Includes 16 unit tests for the scoring functions

Motivation

QMD has test/eval-harness.ts for CLI-based eval, but no SDK-based benchmark that compares all backends side-by-side or tracks quality metrics over time. After building a personal benchmark suite for my vault search, Tobi suggested contributing it upstream.

This is primarily a regression testing tool — users create fixture files for their own vaults to catch quality regressions after config changes, reindexing, or model updates. The distinction matters: user-created fixtures measure "does QMD still return what I expect?", not absolute retrieval quality.

Fixture format

{
  "description": "My vault benchmark",
  "version": 1,
  "collection": "my-collection",
  "queries": [
    {
      "id": "exact-keyword",
      "query": "API versioning",
      "type": "exact",
      "description": "Direct keyword match",
      "expected_files": ["api-design-principles.md"],
      "expected_in_top_k": 1
    }
  ]
}

Usage

# Human-readable table output
qmd bench src/bench/fixtures/example.json

# JSON output for CI / trend tracking
qmd bench src/bench/fixtures/example.json --json

# Filter to specific collection
qmd bench fixture.json -c my-collection

Output

Query                     Backend  P@k    Recall  MRR    F1       ms
----------------------------------------------------------------------
exact-api                 bm25    1.00    1.00   1.00   1.00       3ms
exact-api                 vector  1.00    1.00   1.00   1.00      24ms
exact-api                 hybrid  1.00    1.00   1.00   1.00      26ms
exact-api                 full    1.00    1.00   1.00   1.00     840ms
...

Summary:
  bm25     P@k= 0.800 Recall= 0.900 MRR= 0.850 F1= 0.844 Avg=4ms
  vector   P@k= 0.700 Recall= 0.800 MRR= 0.750 F1= 0.744 Avg=28ms
  hybrid   P@k= 0.900 Recall= 0.950 MRR= 0.920 F1= 0.924 Avg=30ms
  full     P@k= 0.950 Recall= 0.950 MRR= 0.960 F1= 0.950 Avg=850ms

Test plan

npx vitest run test/bench-score.test.ts — 16 unit tests for scoring functions (normalizePath, pathsMatch, scoreResults)
Manual: qmd bench src/bench/fixtures/example.json against a local eval-docs collection
Manual: qmd bench src/bench/fixtures/example.json --json produces valid JSON

🤖 Generated with Claude Code

Adds a benchmark harness that measures search quality across backends. Given a fixture file with queries and expected results, it runs each query through BM25, vector, hybrid (no rerank), and full pipeline, then reports precision@k, recall, MRR, F1, and latency. This is primarily a regression testing tool — users create fixtures for their own vaults to catch quality regressions after config or index changes. Ships with an example fixture against the eval-docs test collection to demonstrate the format. New files: src/bench/bench.ts — main runner src/bench/score.ts — precision, recall, MRR, F1, path matching src/bench/types.ts — fixture and result types src/bench/fixtures/ — example fixture test/bench-score.test.ts — unit tests for scoring (16 tests) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jmilinovich mentioned this pull request Mar 26, 2026

RFC: qmd bench — should this live in core? #472

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cli): add `qmd bench` for search quality benchmarks#470

feat(cli): add `qmd bench` for search quality benchmarks#470
jmilinovich wants to merge 1 commit intotobi:mainfrom
jmilinovich:feat/bench-command

jmilinovich commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jmilinovich commented Mar 26, 2026

Summary

Motivation

Fixture format

Usage

Output

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant