Smarter symbol-level reindexing, ReFRAG memory server alignment, git commit lineage, dev-remote perms + index speed improvements + worktree deup#31
Merged
voarsh2 merged 28 commits intoContext-Engine-AI:testfrom Dec 1, 2025
Conversation
…er for ReFRAG mode Problem: - Memory server created collections with only [dense, lex] vectors, ignoring REFRAG_MODE - Indexer expected [dense, lex, mini] vectors, causing "Not existing vector name: mini" errors - Qdrant doesn't support adding new vector names to existing collections via update_collection() - update_collection() failed silently, leading to indexing failures and loops - Collection recreation attempts were failing, causing indexing to get stuck Solution: - Add REFRAG_MODE support to mcp_memory_server.py _ensure_collection() function - Implement memory backup/restore system in ingest_code.py ensure_collection() for future recreation needs - Export memories (points without file_path) before any recreation attempt - Restore memories with partial vector support for new configurations - Add proper error handling and logging for collection recreation scenarios Impact: - Memory server now creates collections with correct [dense, lex, mini] vectors from start - Eliminates indexing failures and loops caused by missing mini vector - Fixes HTTP 400 errors about missing "mini" vector - Enables proper ReFRAG mode functionality without requiring recreation - Preserves user memories during any future collection configuration changes - Backward compatible and future-proof for additional vector changes
Uses dedicated backup and restore scripts for handling memory persistence during collection recreation. This change replaces the in-line memory backup and restore logic with calls to separate, more robust and testable scripts (`memory_backup.py` and `memory_restore.py`). These scripts provide better error handling, logging, and are designed to be more resilient to changes in the Qdrant client. The scripts are now invoked as subprocesses, ensuring better isolation and management of the backup/restore operations. The ingest code now only handles the overall orchestration and error reporting. Adds `--skip-collection-creation` option to memory restore script to allow restoration of memories into a collection that's already initialized. This is specifically useful when `ingest_code.py` handles collection creation. This change improves maintainability and reduces the complexity of the `ingest_code.py` script.
Adds a try-except block when loading the workspace cache to handle cases where the cache file is corrupt or empty. If an exception occurs during loading, recreates the cache
Implements smarter re-indexing strategy that reuses embeddings and reduces unnecessary re-indexing by leveraging a symbol cache. This change introduces symbol extraction using tree-sitter to identify functions, classes, and methods in code files. It compares the symbols against a cache to determine which parts of the code have changed, allowing for targeted re-indexing of only the modified sections. This significantly reduces the processing time and resource consumption associated with indexing large codebases. Adds the ability to reuse existing embeddings/lexical vectors for unchanged code chunks (identified by code content), and re-embed only changed chunks improving efficiency and overall performance. Also, includes logic for improved pseudo-tag generation.
Enables the collection and indexing of git commit history for enhanced context lineage capabilities. - Introduces configuration options to control the depth and scope of git history ingestion. - Implements mechanisms to extract commit metadata, diffs, and lineage information. - Integrates git history into the existing indexing pipeline, allowing agents to reason about code evolution. - Provides a new command to trigger a force sync to upload git history.
Improves efficiency of git history collection by fetching only the commits since the last successful upload, instead of always starting from the beginning. It constructs a `git rev-list` command that fetches only the commits between the previous HEAD and the current HEAD, if both are available and different. This reduces the amount of data that needs to be processed, improving performance. If either HEAD is missing or they are the same, the command defaults to fetching the entire history from HEAD.
Adds a `STRICT_MEMORY_RESTORE` option to control whether memory restoration failures should halt the reindexing process. Introduces a comprehensive test suite for memory backup and restore operations and fixes an issue where point IDs were not correctly converted to integers during restoration.
- Bump extension version - When contextEngineUploader.targetPath is unset, derive the CTX workspace from the VS Code folder: - If the workspace folder has .context-engine, ctx_config.json, .codebase/state.json, or .git, treat it as the root. - Otherwise, scan one level of child directories; if exactly one child matches those markers, use it. If zero or many, fall back to the workspace folder. - Keep explicit contextEngineUploader.targetPath as the single source of truth when configured. - When running Prompt+ (ctx.py), set CTX_WORKSPACE_DIR using the same target-path/auto-detection logic so ctx.py reads .env/ctx_config.json from the same CTX workspace as the uploader/indexer. - No server-side behavior changes; only VS Code extension workspace detection and Prompt+ wiring are updated.
- docs: document SSE vs HTTP MCP transports and recommend HTTP /mcp endpoints for IDEs (Claude, Windsurf, etc.), including notes on the FastMCP SSE init race - vscode-extension: add mcpTransportMode setting to choose between mcp-remote SSE and direct HTTP MCP for Claude/Windsurf configs - vscode-extension: add autoWriteMcpConfigOnStartup to refresh .mcp.json, Windsurf mcp_config.json, and the Claude hook on extension activation - vscode-extension: update extension README to describe MCP transport modes and startup MCP config behavior - vscode-extension: improve targetPath auto-detection to prefer a single git/ .codebase child repo under the workspace root
- Always probe the configured pythonPath with bundled python_libs before doing anything else - If that fails, auto-detect a working system Python (python3/python/py/Homebrew) via detectSystemPython and reuse the bundled libs - Only as a last resort, prompt to create a private venv and pip-install deps, then switch to that interpreter - Reduces spurious venv prompts when switching between systems where only `python` or `python3` is available
- remote_upload_client/standalone_upload_client: decode git subprocess output as UTF-8 with errors='replace' instead of relying on Windows cp1252 locale - Prevents noisy UnicodeDecodeError stack traces during uploads on Windows while keeping git_history manifests usable
Configures shared volume and environment variables to enable model caching for context-engine components. This reduces redundant downloads and speeds up processing.
…tection flow - Update writeCtxConfig() to call ensurePythonDependencies before probing --show-mapping - Re-resolve upload client options after Python detection so pythonOverridePath is honored - Prevent Windows Store python/python3 aliases from breaking collection inference - Keep inferCollectionFromUpload() behavior the same once a valid interpreter is selected
- Use as_posix() for all relative paths in standalone and remote upload clients - Ensure operations.json uses forward-slash paths (e.g. scripts/foo.py) - Match tar layout so upload_service can extract files from Windows bundles - Fixes issue where only top-level files appeared under /work/<repo>-<hash> on the cluster
Configures the `HF_HOME`, `XDG_CACHE_HOME`, and `HF_HUB_CACHE` environment variables for both the HTTP server and the indexer. This change ensures that both components use a consistent Hugging Face cache directory, preventing redundant downloads and improving efficiency.
Use watchdog's PollingObserver when WATCH_USE_POLLING is set so the watcher can see file changes made by other pods on the shared /work PVC (NFS/CephFS), where inotify events are not reliably propagated across nodes. Default behavior remains unchanged when the env var is unset.
- Make ctx.py trust the server-chosen `path` field when formatting search results, falling back to host_path`/`container_path` only when `path` is missing. This centralizes display-path policy in the indexer (`hybrid_search` + PATH_EMIT_MODE) instead of duplicating it
in the CTX CLI / hook.
- Fix ingest_code.py host_path mapping for remote upload workspaces.
When indexing under /work/<repo-slug> and origin.source_path is set, drop the leading slug segment and map:
/work/<slug>/… → <origin.source_path>/…
so metadata.host_path is a clean, user-facing path rooted at the original client workspace, without embedding the slug directory.
- Update docker-compose.dev-remote PATH_EMIT_MODE from `container` to `auto` for indexer / MCP services so hybrid_search prefers host_path when available and only falls back to container_path. This lets CTX and the VS Code extension show real host/workspace paths derived from
origin.source_path, while still allowing deployments to force pure container paths by setting PATH_EMIT_MODE=container if desired.
Overall, CTX hook output now surfaces consistent, user-facing paths (e.g. /home/.../Context-Engine/…), while container-style /work paths remain available as an explicit server-side configuration choice.
Previously, multi-repo indexing called ensure_collection() and
ensure_payload_indexes() inside the per-file loop. In large
workspaces this meant O(#files × #indexed_fields) Qdrant control-plane traffic (many PUT /collections/<coll>/index?wait=true calls), even on the "Skipping unchanged file (cache)" path.
This change introduces ENSURED_COLLECTIONS and
ensure_collection_and_indexes_once() in ingest_code:
- Single-collection index_repo:
- Still recreates the collection when --recreate is set.
- Now uses ensure_collection_and_indexes_once() so collection + payload indexes are ensured once per process, not repeatedly.
- Multi-repo index_repo:
- For each per-repo collection, calls
ensure_collection_and_indexes_once() the first time it is seen.
- Subsequent files in the same collection (including cached
"Skipping unchanged file (cache)" files) no longer trigger
extra ensure_collection / create_payload_index calls.
- Net effect: Qdrant index setup overhead becomes
O(#collections × #indexed_fields) per process instead of
O(#files × #indexed_fields).
watch_index is updated to use the same helper:
- On startup, ensures the default collection once.
- In _process_paths(), ensures each repo's collection once per
watcher process, then relies on cached state for subsequent
file events, avoiding repeated index-setup chatter.
…nfig-backup-restore
Refines the guidance for AI agents when deciding between using the MCP Qdrant-Indexer and literal search/file-open. The changes emphasize the MCP Qdrant-Indexer as the primary tool for exploration, debugging, and understanding code and history, reserving literal search/file-open for narrow, exact-literal lookups. It also simplifies the heuristics for tool selection and removes redundant descriptions of tool documentation.
…mote compose stack to run after git clone without creating the dirs yourself
Introduces a new "Getting Started" guide for quickly trying Context Engine with VS Code and the dev-remote stack. Updates documentation links across all documents to include the new Getting Started guide.
Contributor
Author
|
Investigating windows related host_path bug appending container "/work"+windows path.... and seemingly empty repo_search results returning no code snippets... |
…er snippets Bug: hybrid_search.run_hybrid_search was updated to emit host paths in the path field (with PATH_EMIT_MODE=auto preferring host_path), while still attaching container_path for the /work/... mirror. mcp_indexer_server.repo_search and context_answer snippet helpers still assumed path was a /work path and refused to read files when path pointed outside /work. Result: repo_search returned empty snippet fields, and context_answer’s identifier-based span filtering lost access to filesystem snippets, even though Qdrant payloads and /work mounts were correct. Fix: In repo_search’s _read_snip, prefer item["container_path"] for filesystem reads, falling back to item["path"] only when container_path is missing. Still enforce the /work sandbox: resolve the candidate path and bail if realpath is not under /work. In context_answer’s _read_span_snippet, do the same: prefer span["container_path"] over span["path"] when resolving the file used to build _ident_snippet. Leave path unchanged in both APIs so callers continue to see host‑centric paths, while internal snippet I/O always targets the server’s /work/... tree. Correctness / compatibility: Works for both local Linux and remote Windows uploads: Remote clients send host paths only in host_path; the indexer populates container_path under /work/..., which is what the server now uses for reads. Existing points that predate dual-path metadata still work via the fallback to path when container_path is absent. The /work realpath guard is retained, so snippet reads remain sandboxed to the mounted workspace. Existing tests like test_repo_search_snippet_strict_cap_after_highlight continue to pass, and manual repo_search calls in the dev‑remote stack now return non-empty snippets without reindexing
…letters - Root cause: when HOST_INDEX_PATH in the indexer container contained a Windows-style path (e.g. "/workc:\Users\Admin\..."), ingest_code combined it with /work/<slug>/... using os.path.join. This produced malformed host_path values like "/work/c:\Users\.../3-5-<slug>/API/Logs.py" that then surfaced in Qdrant metadata and repo_search results. - ingest_code.index_single_file: guard HOST_INDEX_PATH by ignoring any value that contains a colon (Windows drive letter heuristic). When origin.source_path is available we still derive host_path from that; when origin is missing and HOST_INDEX_PATH looks Windows-y we fall back to using the container path instead of constructing /workC:\... strings. - watch_index._rename_in_store: apply the same HOST_INDEX_PATH guard when recomputing host_path/container_path during fast-path renames so we don’t reintroduce malformed host paths for Windows-uploaded repos. - Linux/local behavior is unchanged: normal Unix HOST_INDEX_PATH values do not contain a colon, so the guard is inactive and existing host_path/container_path derivation continues to work as before.
…d cross-worktree dedup - Introduce a stable logical_repo_id in workspace_state, computed from the git common dir when available (git:<sha16>, fs:<sha16> fallback). - Add workspace_state helpers to find or create collections by logical repo (find_collection_for_logical_repo, get_or_create_collection_for_logical_repo), and persist logical_repo_id + qdrant_collection in .codebase state. - Update upload_service to accept logical_repo_id from clients, reuse an existing canonical collection when a mapping exists, and perform a one-time latent migration from legacy fs: IDs when there is a single existing mapping. - Extend ingest_code to store logical file identity in Qdrant payloads (metadata.repo_id + metadata.repo_rel_path) and to prefer this logical identity in get_indexed_file_hash, enabling skip-unchanged across git worktrees / slugs instead of only by absolute path. - Update index_single_file and batch index_repo flows to derive (repo_id, repo_rel_path) from workspace_state + /work layout, and pass them into Qdrant lookups and metadata writes. - Enhance watch_index’s _get_collection_for_repo to consult workspace state and reuse the canonical collection for all slugs sharing the same logical_repo_id, aligning local watcher/indexer behavior with the remote upload service. - Gate all logical-repo / collection reuse behavior behind a new LOGICAL_REPO_REUSE feature flag so default behavior remains the legacy per-repo collection and path-based dedup until explicitly enabled. - Add lightweight logging around logical_repo_id state reads/writes to avoid completely silent failures while keeping the new paths best-effort and backwards compatible. - Document LOGICAL_REPO_REUSE in .env.example (commented out by default), with notes on collection reuse across worktrees and logical (repo_id + repo_rel_path) skip-unchanged semantics.
m1rl0k
pushed a commit
that referenced
this pull request
Mar 1, 2026
…ig-backup-restore Smarter symbol-level reindexing, ReFRAG memory server alignment, git commit lineage, dev-remote perms + index speed improvements + worktree deup
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes a ReFRAG collection configuration mismatch between the memory server and indexer, introduces an opt‑in, symbol‑aware smart reindexing path that reuses embeddings for unchanged code while preserving line numbers, and adds optional git commit lineage indexing wired through the VS Code extension and remote upload pipeline. It also hardens workspace state and symbol cache handling for shared volumes in dev‑remote.
Problem
ReFRAG collection mismatch
[dense, lex]vectors, ignoringREFRAG_MODE.[dense, lex, mini]vectors when ReFRAG was enabled, causingNot existing vector name: minierrors.update_collection().update_collection()calls were effectively no‑ops, leading to indexing failures and retry loops.Inefficient full‑file reindexing
No integrated git commit lineage pipeline
lineage_goal,lineage_symbols,lineage_tags) was only available in ad‑hoc CLI experiments, not through the dev‑remote watcher + upload service.--since, force).search_commits_for/change_history_for_pathin normal workflows.Git history pipeline and initial inefficiency
upload_service/watch_indexweren’t extracting or ingesting git history into Qdrant.REMOTE_UPLOAD_GIT_MAX_COMMITSwindow fromHEADon each run.HEADchanged, with no notion of a “last ingested head”..remote-git, but this store was only used as a drop location for manifests, not to guide incremental ingestion or provide guarantees about what had been processed.Dev‑remote permissions and cache robustness
.codebasepermission errors, especially for symbol caches.cache.jsonand symbol cache files could break indexing, with poor recovery behavior.Solution
1. ReFRAG collection configuration + memory backup/restore
Respect
REFRAG_MODEin memory serverREFRAG_MODEhandling to the memory server’s_ensure_collection()so new collections are created with the same vector layout the indexer expects:[dense, lex].[dense, lex, mini].Safe collection recreation with memory preservation
file_pathin their payload.miniwhen ReFRAG is enabled).mini).Result: Memory server and indexer now agree on vector names from the start; if we ever do need to recreate a collection, user memories are preserved and upgraded safely.
2. Symbol‑aware smart reindexing (opt‑in via
SMART_SYMBOL_REINDEXING)Symbol cache and change detection
workspace_state.py:_get_symbol_cache_path(file_path)→ per‑repo.codebase/repos/<repo>/symbols_<hash>.jsonusing the same repo detection as other state helpers.get_cached_symbols(file_path)/set_cached_symbols(file_path, symbols, file_hash)→ store symbol metadata + content hashes.get_cached_pseudo(...),set_cached_pseudo(...),update_symbols_with_pseudo(...)→ cache pseudo/tags per symbol without creating empty symbol files.compare_symbol_changes(old_symbols, new_symbols)→ compute(unchanged_symbols, changed_symbols)using content hashes.remove_cached_symbols(file_path)→ cleanly remove symbol cache when a file is deleted (used by watcher).Smart reindex pipeline (new code path)
(symbol_id, code_text)for reuse.set_cached_pseudo(...)to only run GLM for changed symbols/chunks.(symbol_id, code_text)matches; otherwise re‑embed.unchanged_symbols,changed_symbols, and per‑file summary:chunks=X, reused_points=Y, embedded_points=Z.Integration and gating
Indexer (index_repo)
SMART_SYMBOL_REINDEXINGand should_use_smart_reindexing(...), call process_file_with_smart_reindexing(...)."success"→ count as indexed and skip legacy path."failed"→ fall back to index_single_file(...).skip_unchangedand symbol change heuristics to limit smart reindexing only when it’s beneficial.Watcher (watch_index.py)
remove_cached_symbols) to keep caches consistent.smart_reason.smart_reason == "no_cached_symbols", run smart reindex once to seed the cache.[SMART_REINDEX][watcher] Using {smart/full} ... ({reason}).Feature flag
SMART_SYMBOL_REINDEXINGto .env.example (default0):Result: When enabled, both batch indexer and watcher reuse embeddings for unchanged code at a symbol level, drastically reducing re‑embedding and pseudo costs, while keeping a clean escape hatch back to the legacy path.
3. Git commit history ingestion and lineage search (opt‑in)
Upload clients: manifest + cache
REMOTE_UPLOAD_GIT_MAX_COMMITS,REMOTE_UPLOAD_GIT_SINCE,REMOTE_UPLOAD_GIT_FORCEfrom the environment.max_commits > 0, collect recent commits and writemetadata/git_history.jsoninto each bundle.<workspace>/.context-engine/git_history_cache.jsonlast_head,max_commits,since,updated_at.REMOTE_UPLOAD_GIT_FORCE=1):git rev-list --no-merges [--since] HEADtruncated tomax_commits.git rev-list --no-merges [--since] <last_head>..<HEAD>so only new commits since the last ingested head are collected.HEAD.Upload service:
.remote-gitmanifest storemetadata/git_history.jsonfrom incoming bundles.<workspace>/.remote-git/git_history_<bundle_id>.jsonWatcher: manifest routing
.remote-git/*.jsonfiles, bypassing normalCODE_EXTSfilters.ingest_history.py --manifest-json /work/.../.remote-git/git_history_<bundle_id>.jsonas a subprocess.COLLECTION_NAME,QDRANT_URL, andREPO_NAMEinto the child process environment.[git_history_manifest]line for observability.Ingest script: manifest mode + lineage summaries
--manifest-jsonmode that:git_history.json).fastembed.lineage_goal,lineage_symbols,lineage_tags.REFRAG_COMMIT_DESCRIBE=1and decoder env (REFRAG_DECODER,REFRAG_RUNTIME, GLM/llamacpp configs).scriptsis importable when run as a subprocess by adding the project root tosys.path.MCP tools and queries
kind="git_message"(or equivalent metadata),commit_id,author_name,authored_date,files,message,lineage_*.search_commits_forandchange_history_for_pathcan now:VS Code extension wiring
context-engine-uploaderextension:contextEngineUploader.gitMaxCommits(default 500).contextEngineUploader.gitSince(empty by default).REMOTE_UPLOAD_GIT_FORCE=1for a run).REMOTE_UPLOAD_GIT_MAX_COMMITS,REMOTE_UPLOAD_GIT_SINCE, andREFRAG_COMMIT_DESCRIBE=1as the toggle for commit lineage summarization.Result: We now have an end‑to‑end, opt‑in pipeline for git commit history and lineage indexing, driven by the VS Code extension and upload clients, with incremental behavior and MCP‑friendly commit points in Qdrant.
4. Workspace state + dev‑remote permissions hardening
State and symbol cache permissions
_atomic_write_state(...)nowchmodsstate.json/cache.jsonto0664so multiple processes (upload, watcher, indexer) can update them on shared volumes.set_cached_symbols(...)ensures symbol cache files are created with0664and parent dirs with0775, improving multi‑process access on.codebasevolumes.Dev‑remote compose alignment
indexer-dev-remoteasuser: "1000:1000", matching watcher and MCP services, so they can all write to.codebasewithout root‑owned artifacts.upload-service-dev-remoteasuser: "0:0"only where needed for Windows/workbind mounts; it no longer has to own.codebase.Impact
ReFRAG / memory server
dense,lex,miniwhere applicable).Not existing vector name: miniand HTTP 400 errors.update_collection()failures that previously left collections in a half‑configured state.file_path) across any future collection config changes.Smart symbol reindexing
SMART_SYMBOL_REINDEXING=1; default is off for safety.[SMART_REINDEX]logging for observability and debugging.Git commit lineage
gitMaxCommits,gitSince, “Upload Git History” / force command).REMOTE_UPLOAD_GIT_MAX_COMMITS,REMOTE_UPLOAD_GIT_SINCE,REMOTE_UPLOAD_GIT_FORCE).REFRAG_COMMIT_DESCRIBE+ decoder env.last_head..HEADdrastically reduce redundant commit reprocessing while remaining idempotent in Qdrant (stable ids percommit_id).Operational robustness
.codebaseand symbol caches.workspace_state.pybehaves better in multi‑process and multi‑repo setups.Backwards Compatibility / Rollout
Feature flags
SMART_SYMBOL_REINDEXINGis default0; legacy indexing path remains unchanged when the flag is off.REFRAG_MODE; when disabled, we still use only[dense, lex].REMOTE_UPLOAD_GIT_MAX_COMMITS(or the extension setting) is non‑zero.REFRAG_COMMIT_DESCRIBE=1and decoder env is configured.Existing collections
REFRAG_MODEis enabled and/or when a collection must be recreated.Rollout plan
SMART_SYMBOL_REINDEXING=1..envas needed.reused_pointsvsembedded_points, symbol cache behavior, git ingest logs) and Qdrant payloads (line numbers, symbol metadata,lineage_*).Testing
Dev-remote (code indexing)
minivector missing errors.SMART_SYMBOL_REINDEXING=0uses legacy behavior;=1triggers smart reindex logs and reduces embedding counts on small edits.Dev-remote (git lineage)
gitMaxCommitsandgitSince, and used the “Upload Git History” / force command.metadata/git_history.jsonand maintaining.context-engine/git_history_cache.json.upload_servicewrites.remote-git/git_history_<bundle_id>.json.[git_history_manifest]and launches ingest_history.py with the correct collection and repo name.ingest_history.py --manifest-jsonimports correctly in the container environment and upserts commit points using stable_id(commit_id).lineage_goal/lineage_tagswhenREFRAG_COMMIT_DESCRIBE=1and decoder env is configured.Qdrant / MCP inspection
search_commits_for,change_history_for_path) to:lineage_goal,lineage_symbols,lineage_tags.