feat(cli): add qmd bench for search quality benchmarks#470
Open
jmilinovich wants to merge 1 commit intotobi:mainfrom
Open
feat(cli): add qmd bench for search quality benchmarks#470jmilinovich wants to merge 1 commit intotobi:mainfrom
qmd bench for search quality benchmarks#470jmilinovich wants to merge 1 commit intotobi:mainfrom
Conversation
Adds a benchmark harness that measures search quality across backends. Given a fixture file with queries and expected results, it runs each query through BM25, vector, hybrid (no rerank), and full pipeline, then reports precision@k, recall, MRR, F1, and latency. This is primarily a regression testing tool — users create fixtures for their own vaults to catch quality regressions after config or index changes. Ships with an example fixture against the eval-docs test collection to demonstrate the format. New files: src/bench/bench.ts — main runner src/bench/score.ts — precision, recall, MRR, F1, path matching src/bench/types.ts — fixture and result types src/bench/fixtures/ — example fixture test/bench-score.test.ts — unit tests for scoring (16 tests) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
qmd bench <fixture.json>command that measures search quality across all four backends (BM25, vector, hybrid no-rerank, full pipeline)eval-docstest collectionMotivation
QMD has
test/eval-harness.tsfor CLI-based eval, but no SDK-based benchmark that compares all backends side-by-side or tracks quality metrics over time. After building a personal benchmark suite for my vault search, Tobi suggested contributing it upstream.This is primarily a regression testing tool — users create fixture files for their own vaults to catch quality regressions after config changes, reindexing, or model updates. The distinction matters: user-created fixtures measure "does QMD still return what I expect?", not absolute retrieval quality.
Fixture format
{ "description": "My vault benchmark", "version": 1, "collection": "my-collection", "queries": [ { "id": "exact-keyword", "query": "API versioning", "type": "exact", "description": "Direct keyword match", "expected_files": ["api-design-principles.md"], "expected_in_top_k": 1 } ] }Usage
Output
Test plan
npx vitest run test/bench-score.test.ts— 16 unit tests for scoring functions (normalizePath, pathsMatch, scoreResults)qmd bench src/bench/fixtures/example.jsonagainst a local eval-docs collectionqmd bench src/bench/fixtures/example.json --jsonproduces valid JSON🤖 Generated with Claude Code