Skip to content

Update hybrid_search.py#42

Merged
m1rl0k merged 2 commits intotestfrom
Search-Scaling
Dec 1, 2025
Merged

Update hybrid_search.py#42
m1rl0k merged 2 commits intotestfrom
Search-Scaling

Conversation

@m1rl0k
Copy link
Collaborator

@m1rl0k m1rl0k commented Dec 1, 2025

feat(hybrid-search): auto-scale search parameters for large codebases

Automatically adjust RRF and retrieval parameters based on collection size to maintain search quality at scale (100k-500k+ LOC codebases).

Changes:

  • Add _scale_rrf_k(): logarithmic RRF k scaling for better score discrimination
  • Add _adaptive_per_query(): sqrt-based candidate retrieval scaling
  • Add _normalize_scores(): z-score + sigmoid normalization for compressed distributions
  • Add _get_collection_stats(): cached collection size lookup (5-min TTL)
  • Apply scaling to both MCP (run_hybrid_search) and CLI paths
  • All scaling enabled by default, no configuration required

Scaling behavior:

  • Threshold: 10,000 points (configurable via HYBRID_LARGE_THRESHOLD)
  • RRF k: 60 → up to 180 (3x max, logarithmic)
  • Per-query: 24 → up to 72 (3x max, sqrt scaling)
  • Score normalization spreads compressed score ranges

Small codebases (<10k points) are unaffected - parameters unchanged.

=== Large Codebase Scaling Tests ===
LARGE_COLLECTION_THRESHOLD: 10000
MAX_RRF_K_SCALE: 3.0
SCORE_NORMALIZE_ENABLED: True
Base RRF_K: 60

--- RRF K Scaling ---
5000 points -> k=60
10000 points -> k=60
50000 points -> k=101
100000 points -> k=120
250000 points -> k=143
500000 points -> k=161

--- Per-Query Scaling ---
5000 points -> per_query=24 (filtered=24)
10000 points -> per_query=24 (filtered=24)
50000 points -> per_query=53 (filtered=37)
100000 points -> per_query=72 (filtered=53)
250000 points -> per_query=72 (filtered=72)
500000 points -> per_query=72 (filtered=72)

--- Score Normalization ---
Before (compressed): [0.5, 0.505, 0.51, 0.495]
After (spread): [0.4443, 0.5557, 0.6617, 0.3383]
Range: 0.4950-0.5100 -> 0.3383-0.6617


Collection: codebase
Points: 16622
Threshold: 10000
Scaling Active: True

RRF K: 60 -> 73 (scale factor: 1.22x)
Per-query: 24 -> 30 (no filters)
Per-query: 24 -> 24 (with filters)

feat(hybrid-search): auto-scale search parameters for large codebases

Automatically adjust RRF and retrieval parameters based on collection
size to maintain search quality at scale (100k-500k+ LOC codebases).

Changes:
- Add _scale_rrf_k(): logarithmic RRF k scaling for better score discrimination
- Add _adaptive_per_query(): sqrt-based candidate retrieval scaling
- Add _normalize_scores(): z-score + sigmoid normalization for compressed distributions
- Add _get_collection_stats(): cached collection size lookup (5-min TTL)
- Apply scaling to both MCP (run_hybrid_search) and CLI paths
- All scaling enabled by default, no configuration required

Scaling behavior:
- Threshold: 10,000 points (configurable via HYBRID_LARGE_THRESHOLD)
- RRF k: 60 → up to 180 (3x max, logarithmic)
- Per-query: 24 → up to 72 (3x max, sqrt scaling)
- Score normalization spreads compressed score ranges

Small codebases (<10k points) are unaffected - parameters unchanged.

=== Large Codebase Scaling Tests ===
LARGE_COLLECTION_THRESHOLD: 10000
MAX_RRF_K_SCALE: 3.0
SCORE_NORMALIZE_ENABLED: True
Base RRF_K: 60

--- RRF K Scaling ---
     5000 points -> k=60
    10000 points -> k=60
    50000 points -> k=101
   100000 points -> k=120
   250000 points -> k=143
   500000 points -> k=161

--- Per-Query Scaling ---
     5000 points -> per_query=24 (filtered=24)
    10000 points -> per_query=24 (filtered=24)
    50000 points -> per_query=53 (filtered=37)
   100000 points -> per_query=72 (filtered=53)
   250000 points -> per_query=72 (filtered=72)
   500000 points -> per_query=72 (filtered=72)

--- Score Normalization ---
  Before (compressed): [0.5, 0.505, 0.51, 0.495]
  After (spread):      [0.4443, 0.5557, 0.6617, 0.3383]
  Range: 0.4950-0.5100 -> 0.3383-0.6617

-------------------
Collection: codebase
Points: 16622
Threshold: 10000
Scaling Active: True

RRF K: 60 -> 73 (scale factor: 1.22x)
Per-query: 24 -> 30 (no filters)
Per-query: 24 -> 24 (with filters)
…dency

Download ONNX reranker and tokenizer during Docker build so the image
is self-contained and works without local model files.

Changes:
- Dockerfile.mcp-indexer: Add curl, download models during build
  - cross-encoder/ms-marco-MiniLM-L-6-v2 ONNX (~90MB)
  - BAAI/bge-base-en-v1.5 tokenizer.json
  - Models baked to /app/models/ with ENV defaults
- .env: Update RERANKER_ONNX_PATH and RERANKER_TOKENIZER_PATH
  to /app/models/ (no /work mount needed)

Result:
- Reranker now works out of the box (inproc_hybrid=1, timeout=0)
- No local models/ directory or volume mounts required
- Image is portable and works on any host
@m1rl0k m1rl0k merged commit a730f2f into test Dec 1, 2025
1 check passed
m1rl0k added a commit that referenced this pull request Mar 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant