Update hybrid_search.py by m1rl0k · Pull Request #42 · Context-Engine-AI/Context-Engine

m1rl0k · 2025-12-01T11:17:15Z

feat(hybrid-search): auto-scale search parameters for large codebases

Automatically adjust RRF and retrieval parameters based on collection size to maintain search quality at scale (100k-500k+ LOC codebases).

Changes:

Add _scale_rrf_k(): logarithmic RRF k scaling for better score discrimination
Add _adaptive_per_query(): sqrt-based candidate retrieval scaling
Add _normalize_scores(): z-score + sigmoid normalization for compressed distributions
Add _get_collection_stats(): cached collection size lookup (5-min TTL)
Apply scaling to both MCP (run_hybrid_search) and CLI paths
All scaling enabled by default, no configuration required

Scaling behavior:

Threshold: 10,000 points (configurable via HYBRID_LARGE_THRESHOLD)
RRF k: 60 → up to 180 (3x max, logarithmic)
Per-query: 24 → up to 72 (3x max, sqrt scaling)
Score normalization spreads compressed score ranges

Small codebases (<10k points) are unaffected - parameters unchanged.

=== Large Codebase Scaling Tests ===
LARGE_COLLECTION_THRESHOLD: 10000
MAX_RRF_K_SCALE: 3.0
SCORE_NORMALIZE_ENABLED: True
Base RRF_K: 60

--- RRF K Scaling ---
5000 points -> k=60
10000 points -> k=60
50000 points -> k=101
100000 points -> k=120
250000 points -> k=143
500000 points -> k=161

--- Per-Query Scaling ---
5000 points -> per_query=24 (filtered=24)
10000 points -> per_query=24 (filtered=24)
50000 points -> per_query=53 (filtered=37)
100000 points -> per_query=72 (filtered=53)
250000 points -> per_query=72 (filtered=72)
500000 points -> per_query=72 (filtered=72)

--- Score Normalization ---
Before (compressed): [0.5, 0.505, 0.51, 0.495]
After (spread): [0.4443, 0.5557, 0.6617, 0.3383]
Range: 0.4950-0.5100 -> 0.3383-0.6617

Collection: codebase
Points: 16622
Threshold: 10000
Scaling Active: True

RRF K: 60 -> 73 (scale factor: 1.22x)
Per-query: 24 -> 30 (no filters)
Per-query: 24 -> 24 (with filters)

feat(hybrid-search): auto-scale search parameters for large codebases Automatically adjust RRF and retrieval parameters based on collection size to maintain search quality at scale (100k-500k+ LOC codebases). Changes: - Add _scale_rrf_k(): logarithmic RRF k scaling for better score discrimination - Add _adaptive_per_query(): sqrt-based candidate retrieval scaling - Add _normalize_scores(): z-score + sigmoid normalization for compressed distributions - Add _get_collection_stats(): cached collection size lookup (5-min TTL) - Apply scaling to both MCP (run_hybrid_search) and CLI paths - All scaling enabled by default, no configuration required Scaling behavior: - Threshold: 10,000 points (configurable via HYBRID_LARGE_THRESHOLD) - RRF k: 60 → up to 180 (3x max, logarithmic) - Per-query: 24 → up to 72 (3x max, sqrt scaling) - Score normalization spreads compressed score ranges Small codebases (<10k points) are unaffected - parameters unchanged. === Large Codebase Scaling Tests === LARGE_COLLECTION_THRESHOLD: 10000 MAX_RRF_K_SCALE: 3.0 SCORE_NORMALIZE_ENABLED: True Base RRF_K: 60 --- RRF K Scaling --- 5000 points -> k=60 10000 points -> k=60 50000 points -> k=101 100000 points -> k=120 250000 points -> k=143 500000 points -> k=161 --- Per-Query Scaling --- 5000 points -> per_query=24 (filtered=24) 10000 points -> per_query=24 (filtered=24) 50000 points -> per_query=53 (filtered=37) 100000 points -> per_query=72 (filtered=53) 250000 points -> per_query=72 (filtered=72) 500000 points -> per_query=72 (filtered=72) --- Score Normalization --- Before (compressed): [0.5, 0.505, 0.51, 0.495] After (spread): [0.4443, 0.5557, 0.6617, 0.3383] Range: 0.4950-0.5100 -> 0.3383-0.6617 ------------------- Collection: codebase Points: 16622 Threshold: 10000 Scaling Active: True RRF K: 60 -> 73 (scale factor: 1.22x) Per-query: 24 -> 30 (no filters) Per-query: 24 -> 24 (with filters)

…dency Download ONNX reranker and tokenizer during Docker build so the image is self-contained and works without local model files. Changes: - Dockerfile.mcp-indexer: Add curl, download models during build - cross-encoder/ms-marco-MiniLM-L-6-v2 ONNX (~90MB) - BAAI/bge-base-en-v1.5 tokenizer.json - Models baked to /app/models/ with ENV defaults - .env: Update RERANKER_ONNX_PATH and RERANKER_TOKENIZER_PATH to /app/models/ (no /work mount needed) Result: - Reranker now works out of the box (inproc_hybrid=1, timeout=0) - No local models/ directory or volume mounts required - Image is portable and works on any host

Update hybrid_search.py

m1rl0k added 2 commits December 1, 2025 06:16

m1rl0k merged commit a730f2f into test Dec 1, 2025
1 check passed

m1rl0k added a commit that referenced this pull request Mar 1, 2026

Merge pull request #42 from m1rl0k/Search-Scaling

f184962

Update hybrid_search.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update hybrid_search.py#42

Update hybrid_search.py#42
m1rl0k merged 2 commits intotestfrom
Search-Scaling

m1rl0k commented Dec 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

m1rl0k commented Dec 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant