Pseudo/tag backfill - indexer base index fast pass, faster index path for unchanged files#46
Merged
m1rl0k merged 19 commits intoContext-Engine-AI:testfrom Dec 8, 2025
Conversation
- Add an explicit mode knob to the repo_search MCP tool (code_first, balanced, docs_first) - Plumb mode through repo_search → run_hybrid_search for in-process hybrid search calls - Make hybrid_search implementation/doc weighting mode-aware: - Default/code_first: full IMPLEMENTATION_BOOST and DOCUMENTATION_PENALTY - balanced: keep impl boost, halve structural doc penalty - docs_first: reduce impl boost and disable structural doc penalty - Keep documentation penalties purely structural (README/docs/.md/etc) instead of query-phrase based - Add MCP-side mode-aware reordering in repo_search: - Group core implementation code, other code, and docs differently for code_first vs docs_first - Implement a code_first post-processing shim to ensure at least N core code hits in the top-K - Tunable via REPO_SEARCH_CODE_FIRST_MIN_CORE and REPO_SEARCH_CODE_FIRST_TOP_K - Thread the mode argument through repo_search_compat so clients can select modes via the compat wrapper
Ensures the hybrid search uses the greater value between the originally requested limit and the rerank_top_n value when reranking is enabled. Also enforces the user-requested limit on the final result set.
Refactors the core code classification logic for more accurate identification, re-using hybrid_search's helpers when available. This change avoids duplicating extension and path-based heuristics and allows for better mode-aware reordering of search results.
Ensures that pseudo and tag metadata from index time are carried through in hybrid search results. This allows downstream consumers, such as repo search rerankers, to incorporate index-time GLM/LLM labels into their scoring or display logic. It enriches candidate documents with pseudo/tags information when available, improving reranking and search result context.
feat(hybrid-search): auto-scale search parameters for large codebases
Automatically adjust RRF and retrieval parameters based on collection
size to maintain search quality at scale (100k-500k+ LOC codebases).
Changes:
- Add _scale_rrf_k(): logarithmic RRF k scaling for better score discrimination
- Add _adaptive_per_query(): sqrt-based candidate retrieval scaling
- Add _normalize_scores(): z-score + sigmoid normalization for compressed distributions
- Add _get_collection_stats(): cached collection size lookup (5-min TTL)
- Apply scaling to both MCP (run_hybrid_search) and CLI paths
- All scaling enabled by default, no configuration required
Scaling behavior:
- Threshold: 10,000 points (configurable via HYBRID_LARGE_THRESHOLD)
- RRF k: 60 → up to 180 (3x max, logarithmic)
- Per-query: 24 → up to 72 (3x max, sqrt scaling)
- Score normalization spreads compressed score ranges
Small codebases (<10k points) are unaffected - parameters unchanged.
=== Large Codebase Scaling Tests ===
LARGE_COLLECTION_THRESHOLD: 10000
MAX_RRF_K_SCALE: 3.0
SCORE_NORMALIZE_ENABLED: True
Base RRF_K: 60
--- RRF K Scaling ---
5000 points -> k=60
10000 points -> k=60
50000 points -> k=101
100000 points -> k=120
250000 points -> k=143
500000 points -> k=161
--- Per-Query Scaling ---
5000 points -> per_query=24 (filtered=24)
10000 points -> per_query=24 (filtered=24)
50000 points -> per_query=53 (filtered=37)
100000 points -> per_query=72 (filtered=53)
250000 points -> per_query=72 (filtered=72)
500000 points -> per_query=72 (filtered=72)
--- Score Normalization ---
Before (compressed): [0.5, 0.505, 0.51, 0.495]
After (spread): [0.4443, 0.5557, 0.6617, 0.3383]
Range: 0.4950-0.5100 -> 0.3383-0.6617
-------------------
Collection: codebase
Points: 16622
Threshold: 10000
Scaling Active: True
RRF K: 60 -> 73 (scale factor: 1.22x)
Per-query: 24 -> 30 (no filters)
Per-query: 24 -> 24 (with filters)
- Avoid deriving a root-level "/work-<hash>" collection in multi-repo mode - Resolve per-file Qdrant collections via get_collection_for_file for all data ops - Fix on_deleted and move/rename delete paths to use repo-specific collections instead of the watcher’s default_collection
Adds a background worker to backfill missing pseudo/tags and lexical vectors in Qdrant. This allows for a two-phase indexing process where base vectors are written first, followed by a background process to enrich them. This is enabled via the `PSEUDO_BACKFILL_ENABLED` environment variable and configured with interval and batch size.
…there's nothing to patch
Adds an init container to each indexer service deployment that waits for Qdrant to be available before starting the indexer. This ensures that the indexer does not start processing data before Qdrant is ready to accept connections, preventing potential data loss or corruption.
Adds a debug mode for the pseudo backfill process, enabled via the `PSEUDO_BACKFILL_DEBUG` environment variable. When enabled, the backfill process tracks and reports detailed statistics, including scanned points, GLM calls, the number of filled and updated vectors, and the reasons for skipping vectors, providing insights into the backfill's performance and potential bottlenecks.
Implements a mechanism to skip re-indexing files based on file size and modification time. This optimization, enabled by the `INDEX_FS_FASTPATH` environment variable, significantly speeds up indexing, particularly when dealing with large repositories or frequent re-indexing operations where file contents may not have changed. The logic retrieves file metadata (size and mtime) from the cache and compares it with the current file's metadata. If they match, the file is skipped, avoiding unnecessary re-reading and processing. The change also updates the cache to store file size and mtime along with the file hash.
When fast-fs is enabled, this commit refreshes the file hash cache with size/mtime information during file skipping. This ensures the cache remains up-to-date even when files are skipped due to unchanged content, and enhances the fast-fs performance.
- Add an INDEX_FS_FASTPATH-gated precheck in index_repo that walks files using cache.json fs metadata (size/mtime) and, when every file matches, exits early before model construction and Qdrant client setup, making true no-change runs much cheaper. - Leave behavior unchanged for any new/changed/uncached files or entries missing size/mtime metadata (these still fall back to the original full index path). - Add a TODO above the collection health check to note that these expensive Qdrant probes should eventually be split into a dedicated "health-check-only" mode, so "nothing changed" runs can remain fast while still offering an explicit way to validate collections.
Contributor
Author
|
I am debating the usefulness of "mode" mcp arg for code_first/docs first... might rip it out after experimenting some more. |
Refactors commit search to incorporate lexical scoring, allowing for ranking of results by relevance when a query is provided. This change replaces the previous strict "all tokens must appear" filter with a field-aware scoring mechanism, enabling the system to identify and prioritize commits that better match the specified behavior phrase. The results are then sorted by score and trimmed to the requested limit.
Ensures related paths are emitted in the appropriate path space (host or container) based on the PATH_EMIT_MODE environment variable. This change introduces a mapping of container paths to host paths to ensure consistent path representation for human-facing interfaces, while preserving container paths for backend usage.
Implements an optional feature to enhance commit search using vector embeddings. This feature allows for a semantic score to be computed for the query by blending it with the lexical/lineage score which is gated by a configuration setting. This commit also includes lazy loading of the fastembed library and a sanitize helper to allow environments without these dependencies to still function with a pure lexical search.
Emphasizes the mandatory nature of Qdrant-Indexer tool usage. Reinforces the importance of semantic search with short natural-language queries, discouraging grep/regex syntax.
m1rl0k
added a commit
that referenced
this pull request
Mar 1, 2026
…st-4 Pseudo/tag backfill - indexer base index fast pass, faster index path for unchanged files
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bugs fixes :
Fixes some bugs in watch_index, namely:
Improvements:
Pseudo / tags passthrough in
hybrid_searchEnsure index‑time
pseudoandtagsmetadata are carried through inhybrid_searchresults:pseudo/tagsfrom the Qdrant payload/metadata and attach them to each returned item.Why this matters:
pseudo/tagsinto their own secondary scoring or UI logic.Scope:
hybrid_searchitself; we intentionally keeppseudo/tagsas metadata inputs for higher layers, not as mandatory ranking signals.repo_search
modeknob (code_first / balanced / docs_first)Add an explicit
modeknob to the repo_search MCP tool:code_first,balanced(default),docs_first.Make run_hybrid_search implementation/doc weighting mode‑aware:
IMPLEMENTATION_BOOST.DOCUMENTATION_PENALTY.TEST_FILE_PENALTYto prefer implementation over tests.MCP‑side, keep result shaping mode‑aware in repo_search:
code_firstgroups core implementation code, other code, then docs when ranking.docs_firstinverts that: docs first, then implementation/test code.code_first, tunable via:REPO_SEARCH_CODE_FIRST_MIN_COREREPO_SEARCH_CODE_FIRST_TOP_KEmpirical behavior (manual stress tests):
code_firstreliably pulls.pyimplementation files to the top and pushes.mddocs down, while still keeping high‑value docs in the tail of the top‑K.docs_firstflips the ordering: doc pages (README, MCP_API, CLAUDE.example, GETTING_STARTED, etc.) dominate the top‑K, with implementation files and tests following.modebehaves as intended:Usage guidance:
mode="code_first"for agents or workflows that need “where is this implemented?” answers, but still want docs as a fallback in the same call.mode="docs_first"when you primarily want conceptual/usage explanations from documentation and only occasionally need to dive into code.path_glob/not_glob) and usemodeas a softer preference layer on top.Perf:
Implements a mechanism to skip re-indexing files based on file size and modification time.
This optimization, enabled by the
INDEX_FS_FASTPATHenvironment variable, significantly speeds up indexing, particularly when dealing with large repositories or frequent re-indexing operations where file contents may not have changed. The logic retrieves file metadata (size and mtime) from the cache and compares it with the current file's metadata. If they match, the file is skipped, avoiding unnecessary re-reading and processing.The change also updates the cache to store file size and mtime along with the file hash.
When fast-fs is enabled, this commit refreshes the file hash cache with size/mtime information during file skipping. This ensures the cache remains up-to-date even when files are skipped due to unchanged content, and enhances the fast-fs
performance.
using cache.json fs metadata (size/mtime) and, when every file matches,
exits early before model construction and Qdrant client setup, making
true no-change runs much cheaper.
missing size/mtime metadata (these still fall back to the original
full index path).
expensive Qdrant probes should eventually be split into a dedicated
"health-check-only" mode, so "nothing changed" runs can remain fast
while still offering an explicit way to validate collections.