Merged
Conversation
feat(hybrid-search): auto-scale search parameters for large codebases
Automatically adjust RRF and retrieval parameters based on collection
size to maintain search quality at scale (100k-500k+ LOC codebases).
Changes:
- Add _scale_rrf_k(): logarithmic RRF k scaling for better score discrimination
- Add _adaptive_per_query(): sqrt-based candidate retrieval scaling
- Add _normalize_scores(): z-score + sigmoid normalization for compressed distributions
- Add _get_collection_stats(): cached collection size lookup (5-min TTL)
- Apply scaling to both MCP (run_hybrid_search) and CLI paths
- All scaling enabled by default, no configuration required
Scaling behavior:
- Threshold: 10,000 points (configurable via HYBRID_LARGE_THRESHOLD)
- RRF k: 60 → up to 180 (3x max, logarithmic)
- Per-query: 24 → up to 72 (3x max, sqrt scaling)
- Score normalization spreads compressed score ranges
Small codebases (<10k points) are unaffected - parameters unchanged.
=== Large Codebase Scaling Tests ===
LARGE_COLLECTION_THRESHOLD: 10000
MAX_RRF_K_SCALE: 3.0
SCORE_NORMALIZE_ENABLED: True
Base RRF_K: 60
--- RRF K Scaling ---
5000 points -> k=60
10000 points -> k=60
50000 points -> k=101
100000 points -> k=120
250000 points -> k=143
500000 points -> k=161
--- Per-Query Scaling ---
5000 points -> per_query=24 (filtered=24)
10000 points -> per_query=24 (filtered=24)
50000 points -> per_query=53 (filtered=37)
100000 points -> per_query=72 (filtered=53)
250000 points -> per_query=72 (filtered=72)
500000 points -> per_query=72 (filtered=72)
--- Score Normalization ---
Before (compressed): [0.5, 0.505, 0.51, 0.495]
After (spread): [0.4443, 0.5557, 0.6617, 0.3383]
Range: 0.4950-0.5100 -> 0.3383-0.6617
-------------------
Collection: codebase
Points: 16622
Threshold: 10000
Scaling Active: True
RRF K: 60 -> 73 (scale factor: 1.22x)
Per-query: 24 -> 30 (no filters)
Per-query: 24 -> 24 (with filters)
…dency Download ONNX reranker and tokenizer during Docker build so the image is self-contained and works without local model files. Changes: - Dockerfile.mcp-indexer: Add curl, download models during build - cross-encoder/ms-marco-MiniLM-L-6-v2 ONNX (~90MB) - BAAI/bge-base-en-v1.5 tokenizer.json - Models baked to /app/models/ with ENV defaults - .env: Update RERANKER_ONNX_PATH and RERANKER_TOKENIZER_PATH to /app/models/ (no /work mount needed) Result: - Reranker now works out of the box (inproc_hybrid=1, timeout=0) - No local models/ directory or volume mounts required - Image is portable and works on any host
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
feat(hybrid-search): auto-scale search parameters for large codebases
Automatically adjust RRF and retrieval parameters based on collection size to maintain search quality at scale (100k-500k+ LOC codebases).
Changes:
Scaling behavior:
Small codebases (<10k points) are unaffected - parameters unchanged.
=== Large Codebase Scaling Tests ===
LARGE_COLLECTION_THRESHOLD: 10000
MAX_RRF_K_SCALE: 3.0
SCORE_NORMALIZE_ENABLED: True
Base RRF_K: 60
--- RRF K Scaling ---
5000 points -> k=60
10000 points -> k=60
50000 points -> k=101
100000 points -> k=120
250000 points -> k=143
500000 points -> k=161
--- Per-Query Scaling ---
5000 points -> per_query=24 (filtered=24)
10000 points -> per_query=24 (filtered=24)
50000 points -> per_query=53 (filtered=37)
100000 points -> per_query=72 (filtered=53)
250000 points -> per_query=72 (filtered=72)
500000 points -> per_query=72 (filtered=72)
--- Score Normalization ---
Before (compressed): [0.5, 0.505, 0.51, 0.495]
After (spread): [0.4443, 0.5557, 0.6617, 0.3383]
Range: 0.4950-0.5100 -> 0.3383-0.6617
Collection: codebase
Points: 16622
Threshold: 10000
Scaling Active: True
RRF K: 60 -> 73 (scale factor: 1.22x)
Per-query: 24 -> 30 (no filters)
Per-query: 24 -> 24 (with filters)