Skip to content

Smarter symbol-level reindexing, ReFRAG memory server alignment, git commit lineage, dev-remote perms + index speed improvements + worktree deup#31

Merged
voarsh2 merged 28 commits intoContext-Engine-AI:testfrom
voarsh2:fix-ingest-refrag-collection-config-backup-restore
Dec 1, 2025
Merged

Smarter symbol-level reindexing, ReFRAG memory server alignment, git commit lineage, dev-remote perms + index speed improvements + worktree deup#31
voarsh2 merged 28 commits intoContext-Engine-AI:testfrom
voarsh2:fix-ingest-refrag-collection-config-backup-restore

Conversation

@voarsh2
Copy link
Contributor

@voarsh2 voarsh2 commented Nov 26, 2025

Summary

This PR fixes a ReFRAG collection configuration mismatch between the memory server and indexer, introduces an opt‑in, symbol‑aware smart reindexing path that reuses embeddings for unchanged code while preserving line numbers, and adds optional git commit lineage indexing wired through the VS Code extension and remote upload pipeline. It also hardens workspace state and symbol cache handling for shared volumes in dev‑remote.


Problem

  • ReFRAG collection mismatch

    • Memory server created collections with only [dense, lex] vectors, ignoring REFRAG_MODE.
    • Indexer expected [dense, lex, mini] vectors when ReFRAG was enabled, causing Not existing vector name: mini errors.
    • Qdrant does not support adding new vector names to existing collections via update_collection().
    • update_collection() calls were effectively no‑ops, leading to indexing failures and retry loops.
    • Collection recreation attempts were failing / not preserving data, causing the indexer to get stuck.
  • Inefficient full‑file reindexing

    • Any change to a file triggered a full delete+reinsert of all chunks and re‑embedding, even when most symbols were unchanged.
    • Pseudo/tag generation ran for entire files repeatedly, increasing GLM usage.
    • Line number accuracy is critical, but we had no symbol‑level strategy to reuse embeddings safely.
  • No integrated git commit lineage pipeline

    • There was no end‑to‑end path from a workspace’s git history into Qdrant as first‑class commit points.
    • Commit lineage metadata (lineage_goal, lineage_symbols, lineage_tags) was only available in ad‑hoc CLI experiments, not through the dev‑remote watcher + upload service.
    • The VS Code extension had no way to trigger or scope git history uploads (max commits, --since, force).
    • We had no durable, queryable commit index that could power search_commits_for / change_history_for_path in normal workflows.
  • Git history pipeline and initial inefficiency

    • There was no end‑to‑end git history pipeline wired into the dev‑remote flow:
      • Upload clients didn’t emit git history manifests in bundles.
      • upload_service / watch_index weren’t extracting or ingesting git history into Qdrant.
    • When we first wired git history into the upload → upload_service → watcher → ingest_history.py pipeline, the initial client implementation:
      • Recomputed a full REMOTE_UPLOAD_GIT_MAX_COMMITS window from HEAD on each run.
      • Re‑embedded and re‑summarized every commit in that window whenever HEAD changed, with no notion of a “last ingested head”.
      • Had no extension‑level git cache; small new commits could still trigger hundreds of redundant decoder + embedding calls.
    • The remote side already stored git bundles under .remote-git, but this store was only used as a drop location for manifests, not to guide incremental ingestion or provide guarantees about what had been processed.
  • Dev‑remote permissions and cache robustness

    • Mixed UIDs (root vs non‑root) and bind‑mounted volumes caused .codebase permission errors, especially for symbol caches.
    • Corrupt or empty cache.json and symbol cache files could break indexing, with poor recovery behavior.

Solution

1. ReFRAG collection configuration + memory backup/restore

  • Respect REFRAG_MODE in memory server

    • Add REFRAG_MODE handling to the memory server’s _ensure_collection() so new collections are created with the same vector layout the indexer expects:
      • ReFRAG off: [dense, lex].
      • ReFRAG on: [dense, lex, mini].
  • Safe collection recreation with memory preservation

    • Implement a backup/restore path in ingest_code.py’s ensure_collection():
      • Before any destructive recreation, export “memories” defined as points without file_path in their payload.
      • Recreate the collection with the desired vector configuration (including mini when ReFRAG is enabled).
      • Restore memories into the new collection, allowing partial vector support (e.g., older memories without mini).
    • Add proper error handling and logging around collection recreation so failures are visible and don’t create silent loops.

Result: Memory server and indexer now agree on vector names from the start; if we ever do need to recreate a collection, user memories are preserved and upgraded safely.


2. Symbol‑aware smart reindexing (opt‑in via SMART_SYMBOL_REINDEXING)

  • Symbol cache and change detection

    • Add symbol‑level cache APIs in workspace_state.py:
      • _get_symbol_cache_path(file_path) → per‑repo .codebase/repos/<repo>/symbols_<hash>.json using the same repo detection as other state helpers.
      • get_cached_symbols(file_path) / set_cached_symbols(file_path, symbols, file_hash) → store symbol metadata + content hashes.
      • get_cached_pseudo(...), set_cached_pseudo(...), update_symbols_with_pseudo(...) → cache pseudo/tags per symbol without creating empty symbol files.
      • compare_symbol_changes(old_symbols, new_symbols) → compute (unchanged_symbols, changed_symbols) using content hashes.
      • remove_cached_symbols(file_path) → cleanly remove symbol cache when a file is deleted (used by watcher).
  • Smart reindex pipeline (new code path)

    • Implement process_file_with_smart_reindexing(...) in ingest_code.py:
      • Extract current symbols and load cached symbols.
      • Compute unchanged vs changed symbol IDs.
      • Load existing Qdrant points for the file and index them by (symbol_id, code_text) for reuse.
      • Re‑chunk the file using the existing chunking strategy (micro/semantic/line, as configured).
      • For each chunk:
        • Use should_process_pseudo_for_chunk(...) + generate_pseudo_tags(...) + set_cached_pseudo(...) to only run GLM for changed symbols/chunks.
        • Attempt to reuse an existing embedding when (symbol_id, code_text) matches; otherwise re‑embed.
      • Replace all points for the file via delete_points_by_path(...) + upsert_points(...), preserving line numbers.
      • Update symbol + file‑hash caches only on success.
      • Emit detailed logs:
        • unchanged_symbols, changed_symbols, and per‑file summary:
          chunks=X, reused_points=Y, embedded_points=Z.
  • Integration and gating

    • Indexer (index_repo)

      • Under SMART_SYMBOL_REINDEXING and should_use_smart_reindexing(...), call process_file_with_smart_reindexing(...).
      • On "success" → count as indexed and skip legacy path.
      • On "failed" → fall back to index_single_file(...).
      • Continue to use skip_unchanged and symbol change heuristics to limit smart reindexing only when it’s beneficial.
    • Watcher (watch_index.py)

      • On file deletion:
        • Remove file‑hash cache and symbol cache (remove_cached_symbols) to keep caches consistent.
      • On file changes in _process_paths(...):
        • If _smart_symbol_reindexing_enabled():
          • Read file text, detect language, compute hash.
          • Call should_use_smart_reindexing(path, file_hash) and interpret smart_reason.
          • Support bootstrap: if smart_reason == "no_cached_symbols", run smart reindex once to seed the cache.
          • Log decisions: [SMART_REINDEX][watcher] Using {smart/full} ... ({reason}).
          • On success → done; on failure → fall back to index_single_file(...).
    • Feature flag

      • Add SMART_SYMBOL_REINDEXING to .env.example (default 0):
        # Smarter re-indexing for symbol cache, reuse embeddings and reduce decoder/pseudo tags to re-index
        SMART_SYMBOL_REINDEXING=0

Result: When enabled, both batch indexer and watcher reuse embeddings for unchanged code at a symbol level, drastically reducing re‑embedding and pseudo costs, while keeping a clean escape hatch back to the legacy path.


3. Git commit history ingestion and lineage search (opt‑in)

  • Upload clients: manifest + cache

    • Extend remote_upload_client.py and standalone_upload_client.py with _collect_git_history_for_workspace(...):
      • Read REMOTE_UPLOAD_GIT_MAX_COMMITS, REMOTE_UPLOAD_GIT_SINCE, REMOTE_UPLOAD_GIT_FORCE from the environment.
      • When max_commits > 0, collect recent commits and write metadata/git_history.json into each bundle.
    • Maintain a git history cache at:
      • <workspace>/.context-engine/git_history_cache.json
      • Fields: last_head, max_commits, since, updated_at.
    • Incremental history:
      • First run (or REMOTE_UPLOAD_GIT_FORCE=1): git rev-list --no-merges [--since] HEAD truncated to max_commits.
      • Subsequent runs (no force, cache present):
        • Use git rev-list --no-merges [--since] <last_head>..<HEAD> so only new commits since the last ingested head are collected.
        • Avoids re‑embedding/re‑summarizing a large window (e.g. 500 commits) for every small change to HEAD.
  • Upload service: .remote-git manifest store

    • upload_service.py now:
      • Extracts metadata/git_history.json from incoming bundles.
      • Writes them under:
        • <workspace>/.remote-git/git_history_<bundle_id>.json
      • This forms a durable, append‑only store of git bundles for that workspace.
  • Watcher: manifest routing

    • watch_index.py:
      • Explicitly enqueues .remote-git/*.json files, bypassing normal CODE_EXTS filters.
      • In _process_paths(...), routes these to _process_git_history_manifest(...).
    • _process_git_history_manifest(...):
      • Spawns ingest_history.py --manifest-json /work/.../.remote-git/git_history_<bundle_id>.json as a subprocess.
      • Propagates COLLECTION_NAME, QDRANT_URL, and REPO_NAME into the child process environment.
      • Logs a [git_history_manifest] line for observability.
  • Ingest script: manifest mode + lineage summaries

    • ingest_history.py:
      • Supports --manifest-json mode that:
        • Reads a git history manifest (git_history.json).
        • For each commit, builds a text payload from message, files, and diff.
        • Embeds it via fastembed.
        • Uses stable_id(commit_id) as the Qdrant point id for idempotent upserts.
      • Uses generate_commit_summary(...) to produce:
        • lineage_goal, lineage_symbols, lineage_tags.
        • Gated by REFRAG_COMMIT_DESCRIBE=1 and decoder env (REFRAG_DECODER, REFRAG_RUNTIME, GLM/llamacpp configs).
      • Ensures scripts is importable when run as a subprocess by adding the project root to sys.path.
  • MCP tools and queries

    • Commit points are stored in the same collection as code, with:
      • kind="git_message" (or equivalent metadata), commit_id, author_name, authored_date, files, message, lineage_*.
    • Existing MCP tools:
      • search_commits_for and change_history_for_path can now:
        • Filter and rank commits by message and lineage tags/goals.
        • Power “when/why did behavior X change?” workflows described in the commit indexing docs.
  • VS Code extension wiring

    • The context-engine-uploader extension:
      • Adds settings:
        • contextEngineUploader.gitMaxCommits (default 500).
        • contextEngineUploader.gitSince (empty by default).
        • A command to force git history upload (sets REMOTE_UPLOAD_GIT_FORCE=1 for a run).
      • Wires these settings into the env for remote_upload_client.py / standalone_upload_client.py.
      • Documents the git history behavior and settings in the extension README.
    • .env.example:
      • Documents REMOTE_UPLOAD_GIT_MAX_COMMITS, REMOTE_UPLOAD_GIT_SINCE, and REFRAG_COMMIT_DESCRIBE=1 as the toggle for commit lineage summarization.

Result: We now have an end‑to‑end, opt‑in pipeline for git commit history and lineage indexing, driven by the VS Code extension and upload clients, with incremental behavior and MCP‑friendly commit points in Qdrant.


4. Workspace state + dev‑remote permissions hardening

  • State and symbol cache permissions

    • _atomic_write_state(...) now chmods state.json / cache.json to 0664 so multiple processes (upload, watcher, indexer) can update them on shared volumes.
    • set_cached_symbols(...) ensures symbol cache files are created with 0664 and parent dirs with 0775, improving multi‑process access on .codebase volumes.
  • Dev‑remote compose alignment

    • Run indexer-dev-remote as user: "1000:1000", matching watcher and MCP services, so they can all write to .codebase without root‑owned artifacts.
    • Keep upload-service-dev-remote as user: "0:0" only where needed for Windows /work bind mounts; it no longer has to own .codebase.
    • Combined with the chmod changes above, this resolves the recurring symbol cache permission errors in dev‑remote.

Impact

  • ReFRAG / memory server

    • Collections created by the memory server now match the indexer’s expectations:
      • Correct vector set (dense, lex, mini where applicable).
    • Eliminates:
      • Not existing vector name: mini and HTTP 400 errors.
      • Silent update_collection() failures that previously left collections in a half‑configured state.
    • Enables ReFRAG mode to work correctly without manual collection recreation.
    • Memory backup/restore preserves user “memories” (points without file_path) across any future collection config changes.
  • Smart symbol reindexing

    • Opt‑in via SMART_SYMBOL_REINDEXING=1; default is off for safety.
    • When enabled:
      • Reuses existing embeddings for unchanged chunks mapped to unchanged symbols (no re-embedding for those chunks).
      • Greatly reduces GLM decoder/pseudo calls for small edits by only regenerating pseudo for changed symbols/chunks.
      • Still rebuilds the full set of points for each changed file (delete + insert), so line numbers and chunk boundaries are always based on the latest text.
      • Provides rich [SMART_REINDEX] logging for observability and debugging.
  • Git commit lineage

    • Optional git history indexing, controlled by:
      • Extension settings (gitMaxCommits, gitSince, “Upload Git History” / force command).
      • Env flags (REMOTE_UPLOAD_GIT_MAX_COMMITS, REMOTE_UPLOAD_GIT_SINCE, REMOTE_UPLOAD_GIT_FORCE).
      • Lineage summarization toggle via REFRAG_COMMIT_DESCRIBE + decoder env.
    • Incremental uploads via last_head..HEAD drastically reduce redundant commit reprocessing while remaining idempotent in Qdrant (stable ids per commit_id).
    • Commit points are now first‑class citizens in the collection, enabling better lineage workflows via MCP tools.
  • Operational robustness

    • Fewer dev‑remote permission flakes on .codebase and symbol caches.
    • workspace_state.py behaves better in multi‑process and multi‑repo setups.
    • Clearer logs and error paths around collection creation/recreation, smart reindex decisions, and git history ingest.

Backwards Compatibility / Rollout

  • Feature flags

    • SMART_SYMBOL_REINDEXING is default 0; legacy indexing path remains unchanged when the flag is off.
    • ReFRAG behavior is driven by REFRAG_MODE; when disabled, we still use only [dense, lex].
    • Git commit lineage is opt‑in:
      • Disabled unless REMOTE_UPLOAD_GIT_MAX_COMMITS (or the extension setting) is non‑zero.
      • Lineage summaries only generated if REFRAG_COMMIT_DESCRIBE=1 and decoder env is configured.
  • Existing collections

    • Existing collections continue to work; new ReFRAG‑aware logic only kicks in when REFRAG_MODE is enabled and/or when a collection must be recreated.
    • Memory backup/restore ensures we do not lose user memories during controlled recreations.
  • Rollout plan

    • Start with dev‑remote / non‑critical environments:
      • Enable SMART_SYMBOL_REINDEXING=1.
      • Configure git history settings in the extension and .env as needed.
      • Verify logs (reused_points vs embedded_points, symbol cache behavior, git ingest logs) and Qdrant payloads (line numbers, symbol metadata, lineage_*).
    • Once validated, consider enabling smart reindexing and git history indexing in a canary cluster before broader rollout.

Testing

  • Dev-remote (code indexing)

    • Ran full repo indexing with ReFRAG enabled; verified:
      • No mini vector missing errors.
      • Memory server and indexer agree on vector configuration.
    • Confirmed SMART_SYMBOL_REINDEXING=0 uses legacy behavior; =1 triggers smart reindex logs and reduces embedding counts on small edits.
    • Verified watcher path:
      • Smart reindex used when symbol cache exists or bootstrap is needed.
      • Fallback to full index_single_file(...) on any smart reindex failure.
      • Symbol caches and file-hash caches removed on file deletes.
  • Dev-remote (git lineage)

    • From the VS Code extension:
      • Configured gitMaxCommits and gitSince, and used the “Upload Git History” / force command.
      • Observed upload clients writing metadata/git_history.json and maintaining .context-engine/git_history_cache.json.
      • Confirmed upload_service writes .remote-git/git_history_<bundle_id>.json.
      • Confirmed watch_index.py logs [git_history_manifest] and launches ingest_history.py with the correct collection and repo name.
    • Ingest:
      • Verified ingest_history.py --manifest-json imports correctly in the container environment and upserts commit points using stable_id(commit_id).
      • Observed non‑empty lineage_goal / lineage_tags when REFRAG_COMMIT_DESCRIBE=1 and decoder env is configured.
  • Qdrant / MCP inspection

    • Scrolled points for selected files before/after small edits:
      • Same or reduced point counts.
      • Updated chunks have correct line ranges and metadata.
      • Unchanged chunks reuse prior IDs/vectors where expected.
    • Used MCP tools (search_commits_for, change_history_for_path) to:
      • Retrieve commits with populated lineage_goal, lineage_symbols, lineage_tags.
      • Confirm that extension‑driven git uploads produce queryable commit history wired into the same collection as code.

…er for ReFRAG

  mode

  Problem:
  - Memory server created collections with only [dense, lex] vectors, ignoring
  REFRAG_MODE
  - Indexer expected [dense, lex, mini] vectors, causing "Not existing vector name:
  mini" errors
  - Qdrant doesn't support adding new vector names to existing collections via
  update_collection()
  - update_collection() failed silently, leading to indexing failures and loops
  - Collection recreation attempts were failing, causing indexing to get stuck

  Solution:
  - Add REFRAG_MODE support to mcp_memory_server.py _ensure_collection() function
  - Implement memory backup/restore system in ingest_code.py ensure_collection() for
  future recreation needs
  - Export memories (points without file_path) before any recreation attempt
  - Restore memories with partial vector support for new configurations
  - Add proper error handling and logging for collection recreation scenarios

  Impact:
  - Memory server now creates collections with correct [dense, lex, mini] vectors from
  start
  - Eliminates indexing failures and loops caused by missing mini vector
  - Fixes HTTP 400 errors about missing "mini" vector
  - Enables proper ReFRAG mode functionality without requiring recreation
  - Preserves user memories during any future collection configuration changes
  - Backward compatible and future-proof for additional vector changes
@voarsh2 voarsh2 changed the title Fix collection configuration mismatch between memory server and index… Fix(coll config mismatch) between memory server and indexer for refrag mode Nov 26, 2025
Uses dedicated backup and restore scripts for handling memory persistence during collection recreation.

This change replaces the in-line memory backup and restore logic with calls to separate, more robust and testable scripts (`memory_backup.py` and `memory_restore.py`). These scripts provide better error handling, logging, and are designed to be more resilient to changes in the Qdrant client.

The scripts are now invoked as subprocesses, ensuring better isolation and management of the backup/restore operations. The ingest code now only handles the overall orchestration and error reporting.

Adds `--skip-collection-creation` option to memory restore script to allow restoration of memories into a collection that's already initialized. This is specifically useful when `ingest_code.py` handles collection creation.

This change improves maintainability and reduces the complexity of the `ingest_code.py` script.
Adds a try-except block when loading the workspace cache to handle cases where the cache file is corrupt or empty.
If an exception occurs during loading, recreates the cache
Implements smarter re-indexing strategy that reuses embeddings and reduces unnecessary re-indexing by leveraging a symbol cache.

This change introduces symbol extraction using tree-sitter to identify functions, classes, and methods in code files. It compares the symbols against a cache to determine which parts of the code have changed, allowing for targeted re-indexing of only the modified sections. This significantly reduces the processing time and resource consumption associated with indexing large codebases.

Adds the ability to reuse existing embeddings/lexical vectors for unchanged code chunks (identified by code content), and re-embed only changed chunks improving efficiency and overall performance.

Also, includes logic for improved pseudo-tag generation.
@voarsh2 voarsh2 changed the title Fix(coll config mismatch) between memory server and indexer for refrag mode Smarter symbol-level reindexing, ReFRAG memory server alignment, and dev-remote perms Nov 26, 2025
Enables the collection and indexing of git commit history for enhanced context lineage capabilities.

- Introduces configuration options to control the depth and scope of git history ingestion.
- Implements mechanisms to extract commit metadata, diffs, and lineage information.
- Integrates git history into the existing indexing pipeline, allowing agents to reason about code evolution.
- Provides a new command to trigger a force sync to upload git history.
Improves efficiency of git history collection by fetching only the commits since the last successful upload, instead of always starting from the beginning.

It constructs a `git rev-list` command that fetches only the commits between the previous HEAD and the current HEAD, if both are available and different. This reduces the amount of data that needs to be processed, improving performance. If either HEAD is missing or they are the same, the command defaults to fetching the entire history from HEAD.
@voarsh2 voarsh2 changed the title Smarter symbol-level reindexing, ReFRAG memory server alignment, and dev-remote perms Smarter symbol-level reindexing, ReFRAG memory server alignment, git commit lineage, and dev-remote perms Nov 27, 2025
Adds a `STRICT_MEMORY_RESTORE` option to control whether memory restoration failures should halt the reindexing process.
Introduces a comprehensive test suite for memory backup and restore operations and fixes an issue where point IDs were not correctly converted to integers during restoration.
- Bump extension version
- When contextEngineUploader.targetPath is unset, derive the CTX workspace
  from the VS Code folder:
  - If the workspace folder has .context-engine, ctx_config.json, .codebase/state.json, or .git, treat it as the root.
  - Otherwise, scan one level of child directories; if exactly one child matches those markers, use it. If zero or many, fall back to the workspace folder.
- Keep explicit contextEngineUploader.targetPath as the single source of
  truth when configured.
- When running Prompt+ (ctx.py), set CTX_WORKSPACE_DIR using the same
  target-path/auto-detection logic so ctx.py reads .env/ctx_config.json
  from the same CTX workspace as the uploader/indexer.
- No server-side behavior changes; only VS Code extension workspace
  detection and Prompt+ wiring are updated.
@voarsh2 voarsh2 marked this pull request as ready for review November 27, 2025 20:55
- docs: document SSE vs HTTP MCP transports and recommend HTTP /mcp endpoints
  for IDEs (Claude, Windsurf, etc.), including notes on the FastMCP SSE init race
- vscode-extension: add mcpTransportMode setting to choose between mcp-remote
  SSE and direct HTTP MCP for Claude/Windsurf configs
- vscode-extension: add autoWriteMcpConfigOnStartup to refresh .mcp.json,
  Windsurf mcp_config.json, and the Claude hook on extension activation
- vscode-extension: update extension README to describe MCP transport modes
  and startup MCP config behavior
- vscode-extension: improve targetPath auto-detection to prefer a single git/
  .codebase child repo under the workspace root
- Always probe the configured pythonPath with bundled python_libs before doing
  anything else
- If that fails, auto-detect a working system Python (python3/python/py/Homebrew)
  via detectSystemPython and reuse the bundled libs
- Only as a last resort, prompt to create a private venv and pip-install deps,
  then switch to that interpreter
- Reduces spurious venv prompts when switching between systems where only
  `python` or `python3` is available
- remote_upload_client/standalone_upload_client: decode git subprocess output
  as UTF-8 with errors='replace' instead of relying on Windows cp1252 locale
- Prevents noisy UnicodeDecodeError stack traces during uploads on Windows
  while keeping git_history manifests usable
Configures shared volume and environment variables
to enable model caching for context-engine components.
This reduces redundant downloads and speeds up processing.
…tection flow

- Update writeCtxConfig() to call ensurePythonDependencies before probing --show-mapping
- Re-resolve upload client options after Python detection so pythonOverridePath is honored
- Prevent Windows Store python/python3 aliases from breaking collection inference
- Keep inferCollectionFromUpload() behavior the same once a valid interpreter is selected
- Use as_posix() for all relative paths in standalone and remote upload clients
- Ensure operations.json uses forward-slash paths (e.g. scripts/foo.py)
- Match tar layout so upload_service can extract files from Windows bundles
- Fixes issue where only top-level files appeared under /work/<repo>-<hash> on the cluster
Configures the `HF_HOME`, `XDG_CACHE_HOME`, and `HF_HUB_CACHE` environment variables for both the HTTP server and the indexer.

This change ensures that both components use a consistent Hugging Face cache directory, preventing redundant downloads and improving efficiency.
Use watchdog's PollingObserver when WATCH_USE_POLLING is set so the watcher can see file changes made by other pods on the shared /work PVC (NFS/CephFS), where inotify events are not reliably propagated across nodes. Default behavior remains unchanged when the env var is unset.
- Make ctx.py trust the server-chosen `path` field when formatting search results, falling back to host_path`/`container_path` only when `path` is missing. This centralizes display-path policy in the indexer (`hybrid_search` + PATH_EMIT_MODE) instead of duplicating it
  in the CTX CLI / hook.

- Fix ingest_code.py host_path mapping for remote upload workspaces.
  When indexing under /work/<repo-slug> and origin.source_path is set, drop the leading slug segment and map:
    /work/<slug>/…  →  <origin.source_path>/…
  so metadata.host_path is a clean, user-facing path rooted at the original client workspace, without embedding the slug directory.

- Update docker-compose.dev-remote PATH_EMIT_MODE from `container` to `auto` for indexer / MCP services so hybrid_search prefers host_path when available and only falls back to container_path. This lets CTX and the VS Code extension show real host/workspace paths derived from
  origin.source_path, while still allowing deployments to force pure container paths by setting PATH_EMIT_MODE=container if desired.

Overall, CTX hook output now surfaces consistent, user-facing paths (e.g. /home/.../Context-Engine/…), while container-style /work paths remain available as an explicit server-side configuration choice.
Previously, multi-repo indexing called ensure_collection() and
ensure_payload_indexes() inside the per-file loop. In large
workspaces this meant O(#files × #indexed_fields) Qdrant control-plane traffic (many PUT /collections/<coll>/index?wait=true calls), even on the "Skipping unchanged file (cache)" path.

This change introduces ENSURED_COLLECTIONS and
ensure_collection_and_indexes_once() in ingest_code:

- Single-collection index_repo:
  - Still recreates the collection when --recreate is set.
  - Now uses ensure_collection_and_indexes_once() so collection + payload indexes are ensured once per process, not repeatedly.

- Multi-repo index_repo:
  - For each per-repo collection, calls
    ensure_collection_and_indexes_once() the first time it is seen.
  - Subsequent files in the same collection (including cached
    "Skipping unchanged file (cache)" files) no longer trigger
    extra ensure_collection / create_payload_index calls.
  - Net effect: Qdrant index setup overhead becomes
    O(#collections × #indexed_fields) per process instead of
    O(#files × #indexed_fields).

watch_index is updated to use the same helper:

- On startup, ensures the default collection once.
- In _process_paths(), ensures each repo's collection once per
  watcher process, then relies on cached state for subsequent
  file events, avoiding repeated index-setup chatter.
@voarsh2 voarsh2 changed the title Smarter symbol-level reindexing, ReFRAG memory server alignment, git commit lineage, and dev-remote perms Smarter symbol-level reindexing, ReFRAG memory server alignment, git commit lineage, dev-remote perms + index speed improvements Nov 28, 2025
Refines the guidance for AI agents when deciding between using the MCP Qdrant-Indexer and literal search/file-open.

The changes emphasize the MCP Qdrant-Indexer as the primary tool for exploration, debugging, and understanding code and history, reserving literal search/file-open for narrow, exact-literal lookups.

It also simplifies the heuristics for tool selection and removes redundant descriptions of tool documentation.
…mote compose stack to run after git clone without creating the dirs yourself
Introduces a new "Getting Started" guide for quickly trying Context Engine with VS Code and the dev-remote stack.

Updates documentation links across all documents to include the new Getting Started guide.
@voarsh2 voarsh2 marked this pull request as draft November 29, 2025 13:58
@voarsh2
Copy link
Contributor Author

voarsh2 commented Nov 29, 2025

Investigating windows related host_path bug appending container "/work"+windows path.... and seemingly empty repo_search results returning no code snippets...
Changed back to in draft until I can identify and fix

…er snippets

Bug:
hybrid_search.run_hybrid_search was updated to emit host paths in the path field (with PATH_EMIT_MODE=auto preferring host_path), while still attaching container_path for the /work/... mirror.
mcp_indexer_server.repo_search and context_answer
 snippet helpers still assumed path was a /work path and refused to read files when path pointed outside /work.
Result:
repo_search returned empty snippet fields, and context_answer’s identifier-based span filtering lost access to filesystem snippets, even though Qdrant payloads and /work mounts were correct.
Fix:
In repo_search’s _read_snip, prefer item["container_path"] for filesystem reads, falling back to item["path"] only when container_path is missing.
Still enforce the /work sandbox: resolve the candidate path and bail if realpath is not under /work.
In context_answer’s _read_span_snippet, do the same: prefer span["container_path"] over span["path"] when resolving the file used to build _ident_snippet.
Leave path unchanged in both APIs so callers continue to see host‑centric paths, while internal snippet I/O always targets the server’s /work/... tree.
Correctness / compatibility:
Works for both local Linux and remote Windows uploads:
Remote clients send host paths only in host_path; the indexer populates container_path under /work/..., which is what the server now uses for reads.
Existing points that predate dual-path metadata still work via the fallback to path when container_path is absent.
The /work realpath guard is retained, so snippet reads remain sandboxed to the mounted workspace.
Existing tests like test_repo_search_snippet_strict_cap_after_highlight
 continue to pass, and manual
repo_search calls in the dev‑remote stack now return non-empty snippets without reindexing
…letters

- Root cause: when HOST_INDEX_PATH in the indexer container contained a Windows-style path
  (e.g. "/workc:\Users\Admin\..."), ingest_code combined it with /work/<slug>/... using os.path.join. This produced malformed host_path values like "/work/c:\Users\.../3-5-<slug>/API/Logs.py" that then surfaced in Qdrant metadata and repo_search results.
- ingest_code.index_single_file: guard HOST_INDEX_PATH by ignoring any value that contains a colon (Windows drive letter heuristic). When origin.source_path is available we still
  derive host_path from that; when origin is missing and HOST_INDEX_PATH looks Windows-y we fall back to using the container path instead of constructing /workC:\... strings.
- watch_index._rename_in_store: apply the same HOST_INDEX_PATH guard when recomputing host_path/container_path during fast-path renames so we don’t reintroduce malformed host paths for Windows-uploaded repos.
- Linux/local behavior is unchanged: normal Unix HOST_INDEX_PATH values do not contain a colon, so the guard is inactive and existing host_path/container_path derivation continues to work as before.
…d cross-worktree dedup

- Introduce a stable logical_repo_id in workspace_state, computed from
  the git common dir when available (git:<sha16>, fs:<sha16> fallback).
- Add workspace_state helpers to find or create collections by logical
  repo (find_collection_for_logical_repo, get_or_create_collection_for_logical_repo),
  and persist logical_repo_id + qdrant_collection in .codebase state.
- Update upload_service to accept logical_repo_id from clients, reuse an
  existing canonical collection when a mapping exists, and perform a
  one-time latent migration from legacy fs: IDs when there is a single
  existing mapping.
- Extend ingest_code to store logical file identity in Qdrant payloads
  (metadata.repo_id + metadata.repo_rel_path) and to prefer this logical
  identity in get_indexed_file_hash, enabling skip-unchanged across
  git worktrees / slugs instead of only by absolute path.
- Update index_single_file and batch index_repo flows to derive
  (repo_id, repo_rel_path) from workspace_state + /work layout, and pass
  them into Qdrant lookups and metadata writes.
- Enhance watch_index’s _get_collection_for_repo to consult workspace
  state and reuse the canonical collection for all slugs sharing the
  same logical_repo_id, aligning local watcher/indexer behavior with the
  remote upload service.
- Gate all logical-repo / collection reuse behavior behind a new
  LOGICAL_REPO_REUSE feature flag so default behavior remains the legacy
  per-repo collection and path-based dedup until explicitly enabled.
- Add lightweight logging around logical_repo_id state reads/writes to
  avoid completely silent failures while keeping the new paths
  best-effort and backwards compatible.
- Document LOGICAL_REPO_REUSE in .env.example (commented out by default),
  with notes on collection reuse across worktrees and logical
  (repo_id + repo_rel_path) skip-unchanged semantics.
@voarsh2 voarsh2 changed the title Smarter symbol-level reindexing, ReFRAG memory server alignment, git commit lineage, dev-remote perms + index speed improvements Smarter symbol-level reindexing, ReFRAG memory server alignment, git commit lineage, dev-remote perms + index speed improvements + worktree deup Nov 29, 2025
@voarsh2 voarsh2 marked this pull request as ready for review December 1, 2025 00:28
@voarsh2 voarsh2 merged commit cff177a into Context-Engine-AI:test Dec 1, 2025
1 check passed
@voarsh2 voarsh2 deleted the fix-ingest-refrag-collection-config-backup-restore branch December 10, 2025 04:09
m1rl0k pushed a commit that referenced this pull request Mar 1, 2026
…ig-backup-restore

Smarter symbol-level reindexing, ReFRAG memory server alignment, git commit lineage, dev-remote perms + index speed improvements + worktree deup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants