diff --git a/README.md b/README.md
index 973eb39d..28c6357c 100644
--- a/README.md
+++ b/README.md
@@ -1,5 +1,9 @@
[](https://github.com/m1rl0k/Context-Engine/actions/workflows/ci.yml)
+**Documentation:** README · [Configuration](docs/CONFIGURATION.md) · [IDE Clients](docs/IDE_CLIENTS.md) · [MCP API](docs/MCP_API.md) · [ctx CLI](docs/CTX_CLI.md) · [Memory Guide](docs/MEMORY_GUIDE.md) · [Architecture](docs/ARCHITECTURE.md) · [Multi-Repo](docs/MULTI_REPO_COLLECTIONS.md) · [Kubernetes](deploy/kubernetes/README.md) · [VS Code Extension](docs/vscode-extension.md) · [Troubleshooting](docs/TROUBLESHOOTING.md) · [Development](docs/DEVELOPMENT.md)
+
+---
+
## Context-Engine at a Glance
Context-Engine is a plug-and-play MCP retrieval stack that unifies code indexing, hybrid search, and optional llama.cpp decoding so product teams can ship context-aware agents in minutes, not weeks.
@@ -9,26 +13,36 @@ Context-Engine is a plug-and-play MCP retrieval stack that unifies code indexing
**Key differentiators**
-- One-command bring-up delivers dual SSE/RMCP endpoints, seeded Qdrant, and live watch/reindex loops for fast local validation.
-- ReFRAG-inspired micro-chunking, token budgeting, and gate-first filtering surface precise spans while keeping prompts lean.
-- Shared memory/indexer schema and reranker tooling make it easy to mix dense, lexical, and semantic signals without bespoke glue code.
-- **NEW: Performance optimizations** including connection pooling, intelligent caching, request deduplication, and async subprocess management that cut redundant calls and smooth spikes under load.
-- Operational playbooks (prune, warm, health, cache) plus rich tests give teams confidence to take the stack from laptop to production.
+- One-command bring-up delivers dual SSE/RMCP endpoints, seeded Qdrant, and live watch/reindex loops
+- ReFRAG-inspired micro-chunking, token budgeting, and gate-first filtering surface precise spans
+- Shared memory/indexer schema and reranker tooling for dense, lexical, and semantic signals
+- **ctx CLI prompt enhancer** with multi-pass unicorn mode for code-grounded prompt rewriting
+- VS Code extension with Prompt+ button and automatic workspace sync
+- Kubernetes deployment with Kustomize for remote/scalable setups
+- Performance optimizations: connection pooling, caching, deduplication, async subprocess management
**Built for**
-- AI platform and IDE tooling teams that need an MCP-compliant context layer without rebuilding indexing, embeddings, or retrieval heuristics.
-- DevEx and documentation groups standing up internal assistants that must ingest large or fast-changing codebases with minimal babysitting.
+- AI platform and IDE tooling teams needing an MCP-compliant context layer
+- DevEx groups standing up internal assistants for large or fast-changing codebases
-**Solves**
-- Slow agent onboarding caused by fractured infra—ship a consistent stack for memory, search, and decoding under one config.
-- Context drift in monorepos—automatic micro-chunking and watcher-driven reindexing keep embeddings aligned with reality.
-- Fragmented client compatibility—serve both legacy SSE and modern HTTP RMCP clients from the same deployment.
-- **NEW: Performance relief** via intelligent caching, connection pooling, and async I/O patterns that eliminate redundant processing.
+## Supported Clients
-## Context-Engine
+| Client | Transport | Notes |
+|--------|-----------|-------|
+| Roo | SSE/RMCP | Both SSE and RMCP connections |
+| Cline | SSE/RMCP | Both SSE and RMCP connections |
+| Windsurf | SSE/RMCP | Both SSE and RMCP connections |
+| Zed | SSE | Uses mcp-remote bridge |
+| Kiro | SSE | Uses mcp-remote bridge |
+| Qodo | RMCP | Direct HTTP endpoints |
+| OpenAI Codex | RMCP | TOML config |
+| Augment | SSE | Simple JSON configs |
+| AmpCode | SSE | Simple URL for SSE endpoints |
+| Claude Code CLI | SSE | Simple JSON configs |
+> **See [docs/IDE_CLIENTS.md](docs/IDE_CLIENTS.md) for detailed configuration examples.**
-## Context-Engine Quickstart (5 minutes)
+## Quickstart (5 minutes)
This gets you from zero to “search works” in under five minutes.
@@ -67,85 +81,25 @@ HOST_INDEX_PATH=. COLLECTION_NAME=codebase docker compose run --rm indexer --roo
- Ports: 8000/8001 (/sse) and 8002/8003 (/mcp)
- Command: `INDEX_MICRO_CHUNKS=1 MAX_MICRO_CHUNKS_PER_FILE=200 make reset-dev-dual`
-### Environment Configuration
-
-**Default Setup:**
-- The repository includes `.env.example` with sensible defaults for local development
-- On first run, copy it to `.env`: `cp .env.example .env`
-- The `make reset-dev*` targets will use your `.env` settings automatically
-
-**Key Configuration Files:**
-- `.env` — Your local environment variables (gitignored, safe to customize)
-- `.env.example` — Template with documented defaults (committed to repo)
-- `docker-compose.yml` — Service definitions that read from `.env`
-
-**Recommended Customizations:**
-
-1. **Enable micro-chunking** (better retrieval quality):
- ```bash
- INDEX_MICRO_CHUNKS=1
- MAX_MICRO_CHUNKS_PER_FILE=200
- ```
-
-2. **Enable decoder for Q&A** (context_answer tool):
- ```bash
- REFRAG_DECODER=1 # Enable decoder (default: 1)
- REFRAG_RUNTIME=llamacpp # Use llama.cpp (default) or glm
- ```
+### Environment Setup
-3. **GPU acceleration** (Apple Silicon Metal):
- ```bash
- # Option A: Use the toggle script (recommended)
- scripts/gpu_toggle.sh gpu
- scripts/gpu_toggle.sh start
-
- # Option B: Manual .env settings
- USE_GPU_DECODER=1
- LLAMACPP_URL=http://host.docker.internal:8081
- LLAMACPP_GPU_LAYERS=32 # or -1 for all layers
- ```
-
-4. **Alternative: GLM API** (instead of local llama.cpp):
- ```bash
- REFRAG_RUNTIME=glm
- GLM_API_KEY=your-api-key-here
- GLM_MODEL=glm-4.6 # Optional, defaults to glm-4.6
- ```
-
-5. **Collection name** (unified by default):
- ```bash
- COLLECTION_NAME=codebase # Default: single unified collection for all code
- # Only change this if you need isolated collections per project
- ```
-
-**After changing `.env`:**
-- Restart services: `docker compose restart mcp_indexer mcp_indexer_http`
-- For indexing changes: `make reindex` or `make reindex-hard`
-- For decoder changes: `docker compose up -d --force-recreate llamacpp` (or restart native server)
-
-### Switch decoder model (llama.cpp)
-- Default tiny model: Granite 4.0 Micro (Q4_K_M GGUF)
-- Change the model by overriding Make vars (downloads to ./models/model.gguf):
```bash
-LLAMACPP_MODEL_URL="https://huggingface.co/ORG/MODEL/resolve/main/model.gguf" \
- INDEX_MICRO_CHUNKS=1 MAX_MICRO_CHUNKS_PER_FILE=200 make reset-dev-dual
+cp .env.example .env # Copy template on first run
```
-- Want GPU acceleration? Set `LLAMACPP_USE_GPU=1` (optionally `LLAMACPP_GPU_LAYERS=-1`) in your `.env` before `docker compose up`, or simply run `scripts/gpu_toggle.sh gpu` (described below) to flip the switch for you.
-- Embeddings: set EMBEDDING_MODEL in .env and reindex (make reindex)
-
-Decoder env toggles (set in `.env` and managed automatically by `scripts/gpu_toggle.sh`):
+Key settings (see [docs/CONFIGURATION.md](docs/CONFIGURATION.md) for full reference):
-| Variable | Description | Typical values |
-|-----------------------|-------------------------------------------------------|----------------|
-| `USE_GPU_DECODER` | Feature-flag for native Metal decoder | `0` (docker), `1` (native) |
-| `LLAMACPP_URL` | Decoder endpoint containers should use | `http://llamacpp:8080` or `http://host.docker.internal:8081` |
-| `LLAMACPP_GPU_LAYERS` | Number of layers to offload to GPU (`-1` = all) | `0`, `32`, `-1` |
+| Setting | Purpose | Default |
+|---------|---------|---------|
+| `INDEX_MICRO_CHUNKS=1` | Enable micro-chunking | 0 |
+| `REFRAG_DECODER=1` | Enable LLM decoder | 1 |
+| `REFRAG_RUNTIME` | Decoder backend | llamacpp |
+| `COLLECTION_NAME` | Qdrant collection | codebase |
-
-Alternative (compose only)
+**GPU acceleration (Apple Silicon):**
```bash
-HOST_INDEX_PATH="$(pwd)" FASTMCP_INDEXER_PORT=8001 docker compose up -d qdrant mcp mcp_indexer indexer watcher
+scripts/gpu_toggle.sh gpu # Switch to native Metal
+scripts/gpu_toggle.sh start # Start GPU decoder
```
### Recommended development flow
@@ -194,578 +148,117 @@ This re-enables the `llamacpp` container and resets `.env` to `http://llamacpp:8
### CLI: ctx prompt enhancer
-A thin CLI that retrieves code context and rewrites your input into a better, context-aware prompt using the local LLM decoder. Works with both questions and commands/instructions. By default it prints ONLY the improved prompt.
-
-Examples:
-````bash
-# Questions: Enhanced with specific details and multiple aspects
-scripts/ctx.py "What is ReFRAG?"
-# Output: Two detailed question paragraphs with file/line references
-
-# Commands: Enhanced with concrete targets and implementation details
-scripts/ctx.py "Refactor ctx.py"
-# Output: Two detailed instruction paragraphs with specific steps
-
-# Unicorn mode: staged 2–3 pass enhancement for best results
-scripts/ctx.py "Refactor ctx.py" --unicorn
-
-# Via Make target (default improved prompt only)
-make ctx Q="Explain the caching logic to me in detail"
-
-# Filter by language/path or adjust tokens
-make ctx Q="Hybrid search details" ARGS="--language python --under scripts/ --limit 2 --rewrite-max-tokens 200"
-````
-
-
-### Detail mode (short snippets)
-
-Include compact code snippets in the retrieved context for richer rewrites (trades a bit of speed for quality):
-
-````bash
-# Enable detail mode (adds short snippets) - works with questions
-scripts/ctx.py "Explain the caching logic" --detail
-
-# Detail mode with commands - gets more specific implementation details
-scripts/ctx.py "Add error handling to ctx.py" --detail
-
-# Adjust snippet size if needed (default is 1 line when --detail is used)
-make ctx Q="Explain hybrid search" ARGS="--detail --context-lines 2"
-````
-
-Notes:
-- Default behavior is header-only (fastest). `--detail` adds short snippets.
-- If `--detail` is set and `--context-lines` remains at its default (0), ctx.py automatically uses 1 line to keep snippets concise. Override with `--context-lines N`.
-- Detail mode is optimized for speed: automatically clamps to max 4 results and 1 result per file.
-
-### Unicorn mode (staged multi-pass for best quality)
-
-Use `--unicorn` for the highest quality prompt enhancement with a staged 2-3 pass approach:
-
-````bash
-# Unicorn mode with commands - produces exceptional, detailed instructions
-scripts/ctx.py "refactor ctx.py" --unicorn
-
-# Unicorn mode with questions - produces highly intelligent, multi-faceted questions
-scripts/ctx.py "what is ReFRAG and how does it work?" --unicorn
+A CLI that retrieves code context and rewrites your input into a better, code-grounded prompt using the local LLM decoder.
-# Works with all filters
-scripts/ctx.py "add error handling" --unicorn --language python
-````
+**Features:**
+- **Unicorn mode** (`--unicorn`): Multi-pass enhancement with 2-3 refinement stages
+- **Detail mode** (`--detail`): Include compact code snippets for richer context
+- **Memory blending**: Falls back to stored memories when code search returns no hits
+- **Streaming**: Real-time token output for instant feedback
+- **Filters**: `--language`, `--under`, `--limit` to scope retrieval
-**How it works:**
+```bash
+scripts/ctx.py "What is ReFRAG?" # Basic question
+scripts/ctx.py "Refactor ctx.py" --unicorn # Multi-pass enhancement
+scripts/ctx.py "Add error handling" --detail # With code snippets
+make ctx Q="Explain caching" # Via Make target
+```
-Unicorn mode uses multiple LLM passes with progressively richer code context:
-
-1. **Pass 1 (Draft)**: Retrieves rich code snippets (8 lines of context per match) to understand the codebase and sharpen the intent
-2. **Pass 2 (Refine)**: Retrieves even richer snippets (12 lines of context) based on the draft to ground the prompt with concrete code behaviors
-3. **Pass 3 (Polish)**: Optional cleanup pass that runs only if the output appears generic or incomplete
-
-**Key features:**
-
-- **Code-grounded**: References actual code behaviors and patterns from your codebase, not file paths or line numbers
-- **No hallucinations**: Only uses real code from your indexed repository - never invents references
-- **Multi-paragraph output**: Produces detailed, comprehensive prompts that explore multiple aspects
-- **Works with both questions and commands**: Enhances any type of prompt
-
-**When to use:**
-
-- **Normal mode**: Quick, everyday prompts (fastest)
-- **--detail**: Richer context without multi-pass overhead (balanced)
-- **--unicorn**: When you need the absolute best prompt quality (highest quality)
-
-### Advanced Features
-
-#### 1. Streaming Output (Default)
-
-All modes now stream tokens as they arrive for instant feedback:
-
-````bash
-# Streaming is enabled by default - see output appear immediately
-scripts/ctx.py "refactor ctx.py" --unicorn
-````
-
-To disable streaming (wait for full response):
-- Set `"streaming": false` in `~/.ctx_config.json`
-
-#### 2. Memory Blending
-
-Automatically falls back to `context_search` with memories when repo search returns no hits:
-
-````bash
-# If no code matches, ctx.py will search design docs and ADRs
-scripts/ctx.py "What is our authentication strategy?"
-````
-
-This ensures you get relevant context even when the query doesn't match code directly.
-
-#### 3. Adaptive Context Sizing
-
-Automatically adjusts `limit` and `context_lines` based on query characteristics:
-
-- **Short/vague queries** → More context for richer grounding
-- **Queries with file/function names** → Lighter settings for speed
-
-````bash
-# Short query → auto-increases context
-scripts/ctx.py "caching"
-
-# Specific query → optimized for speed
-scripts/ctx.py "refactor fetch_context function in ctx.py"
-````
+> **See [docs/CTX_CLI.md](docs/CTX_CLI.md) for full documentation.**
-#### 4. Automatic Quality Assurance
+## Index Another Codebase
-Enhanced `_needs_polish()` heuristic automatically triggers a third polish pass when:
-
-- Output is too short (< 180 chars)
-- Contains generic/vague language
-- Missing concrete code references
-- Lacks proper paragraph structure
+```bash
+# Index a specific path
+make index-path REPO_PATH=/path/to/repo [RECREATE=1]
-This happens transparently in `--unicorn` mode - no user action needed.
+# Index current directory
+cd /path/to/repo && make -C /path/to/Context-Engine index-here
-#### 5. Personalized Templates
+# Raw docker compose
+docker compose run --rm -v /path/to/repo:/work indexer --root /work --recreate
+```
-Create `~/.ctx_config.json` to customize prompt enhancement behavior:
+> **See [docs/MULTI_REPO_COLLECTIONS.md](docs/MULTI_REPO_COLLECTIONS.md) for multi-repo architecture and remote deployment.**
-````json
-{
- "always_include_tests": true,
- "prefer_bullet_commands": false,
- "extra_instructions": "Always consider error handling and edge cases",
- "streaming": true
-}
-````
+## Verify Endpoints
-**Available preferences:**
+```bash
+curl -sSf http://localhost:6333/readyz && echo "Qdrant OK"
+curl -sI http://localhost:8001/sse | head -n1 # SSE
+curl -sI http://localhost:8003/mcp | head -n1 # RMCP
+```
-- `always_include_tests`: Add testing considerations to all prompts
-- `prefer_bullet_commands`: Format commands as bullet points
-- `extra_instructions`: Custom instructions added to every rewrite
-- `streaming`: Enable/disable streaming output (default: true)
+---
-See `ctx_config.example.json` for a template.
+## Documentation
-GPU Acceleration (Apple Silicon):
-For faster prompt rewriting, use the native Metal-accelerated decoder:
-````bash
-# 1. Set USE_GPU_DECODER=1 in your .env file (already set by default)
-# 2. Start the native llama.cpp server with Metal GPU
-scripts/gpu_toggle.sh start
+| Topic | Description |
+|-------|-------------|
+| [Configuration](docs/CONFIGURATION.md) | Complete environment variable reference |
+| [IDE Clients](docs/IDE_CLIENTS.md) | Setup for Roo, Cline, Windsurf, Zed, Kiro, Qodo, Codex, Augment |
+| [MCP API](docs/MCP_API.md) | Full API reference for all MCP tools |
+| [ctx CLI](docs/CTX_CLI.md) | Prompt enhancer CLI with unicorn mode |
+| [Memory Guide](docs/MEMORY_GUIDE.md) | Memory patterns and metadata schema |
+| [Architecture](docs/ARCHITECTURE.md) | System design and component interactions |
+| [Multi-Repo](docs/MULTI_REPO_COLLECTIONS.md) | Multi-repository indexing and remote deployment |
+| [Kubernetes](deploy/kubernetes/README.md) | Kubernetes deployment with Kustomize |
+| [VS Code Extension](docs/vscode-extension.md) | Workspace uploader and Prompt+ integration |
+| [Troubleshooting](docs/TROUBLESHOOTING.md) | Common issues and solutions |
+| [Development](docs/DEVELOPMENT.md) | Contributing and development setup |
-# Now ctx.py will automatically use the GPU decoder on port 8081
-make ctx Q="Explain the caching logic to me in detail"
+---
-# Stop the native GPU server
-scripts/gpu_toggle.sh stop
+## Available MCP Tools
-# To use Docker decoder instead, set USE_GPU_DECODER=0 in .env and restart:
-docker compose up -d llamacpp
-````
+**Memory MCP** (port 8000 SSE, 8002 RMCP):
+- `store` — save memories with metadata
+- `find` — hybrid memory search
+- `set_session_defaults` — set default collection for session
-Notes:
-- Defaults to the Indexer HTTP RMCP endpoint at http://localhost:8003/mcp (override with MCP_INDEXER_URL)
-- Decoder endpoint: automatically detects GPU mode via USE_GPU_DECODER env var (set by gpu_toggle.sh)
-- Docker decoder (default): http://localhost:8080/completion
-- GPU decoder (after gpu_toggle.sh gpu): http://localhost:8081/completion
-- See also: `make ctx`
+**Indexer MCP** (port 8001 SSE, 8003 RMCP):
+- **Search**: `repo_search`, `code_search`, `context_search`, `context_answer`
+- **Specialized**: `search_tests_for`, `search_config_for`, `search_callers_for`, `search_importers_for`
+- **Indexing**: `qdrant_index_root`, `qdrant_index`, `qdrant_prune`
+- **Status**: `qdrant_status`, `qdrant_list`, `workspace_info`, `list_workspaces`, `collection_map`
+- **Utilities**: `expand_query`, `change_history_for_path`, `set_session_defaults`
-## Index another codebase (outside this repo)
+> **See [docs/MCP_API.md](docs/MCP_API.md) for complete API documentation.**
-You can index any local folder by mounting it at /work. Three easy ways:
+## Language Support
-1) Make target: index a specific path
-```bash
-make index-path REPO_PATH=/abs/path/to/other/repo [RECREATE=1] [REPO_NAME=name] [COLLECTION=name]
-```
-- RECREATE=1 drops and recreates the collection before indexing
-- Defaults: REPO_NAME and COLLECTION fall back to the folder name
+Python, JavaScript/TypeScript, Go, Java, Rust, Shell, Terraform, PowerShell, YAML, C#, PHP
-2) Make target: index the current working directory
-```bash
-cd /abs/path/to/other/repo
-make -C /Users/user/Desktop/Context-Engine index-here [RECREATE=1] [REPO_NAME=name] [COLLECTION=name]
-```
+## Running Tests
-3) Raw docker compose (one‑shot ingest without Make)
```bash
-docker compose run --rm \
- -v /abs/path/to/other/repo:/work \
- indexer --root /work [--recreate]
-```
-Notes:
-- No need to bind-mount this repository; the images bake /app/scripts and set WORK_ROOTS="/work,/app" so utilities import correctly.
-- MCP clients can connect to the running servers and operate on whichever folder is mounted at /work.
-
-## Supported IDE clients/extensions
-- Roo (SSE/RMCP): supports both SSE and RMCP connections; see config examples below
-- Cline (SSE/RMCP): supports both SSE and RMCP connections; see config examples below
-- Windsurf (SSE/RMCP): supports both SSE and RMCP connections; see config examples below
-- Zed (SSE): uses mcp-remote bridge via command/args; see config below
-- Kiro (SSE): uses mcp-remote bridge via command/args; see config below
-- Qodo (RMCP): connects directly to HTTP endpoints; add each tool individually
-- OpenAI Codex (RMCP): TOML config for memory/indexer URLs
-- Augment (SSE): simple JSON configs for both servers
-- AmpCode (SSE): simple URL for both legacy sse endpoints
-- Claude Code CLI(SSE): simple JSON configs for both servers
-
-3) Verify endpoints
-````bash
-# Qdrant DB
-curl -sSf http://localhost:6333/readyz >/dev/null && echo "Qdrant OK"
-# Decoder (llama.cpp sidecar)
-curl -s http://localhost:8080/health
-# SSE endpoints (Memory, Indexer)
-curl -sI http://localhost:8000/sse | head -n1
-curl -sI http://localhost:8001/sse | head -n1
-# RMCP endpoints (HTTP JSON-RPC)
-curl -sI http://localhost:8002/mcp | head -n1
-curl -sI http://localhost:8003/mcp | head -n1
-````
-
-## Configuration reference (env vars)
-
-Core
-- COLLECTION_NAME: Qdrant collection to use (defaults to repo name if unset in some flows)
-- REPO_NAME: Logical name for the indexed repo; stored in payload for filtering
-- HOST_INDEX_PATH: Absolute host path to index (mounted to /work in containers)
-
-Indexing / micro-chunks
-- INDEX_MICRO_CHUNKS: 1 to enable micro‑chunking; off falls back to line chunks
-- MAX_MICRO_CHUNKS_PER_FILE: Cap micro‑chunks per file (e.g., 200 default)
-- TOKENIZER_URL, TOKENIZER_PATH: Hugging Face tokenizer.json URL and local path
-- USE_TREE_SITTER: 1 to enable tree-sitter parsing (optional; off by default)
-
-Watcher
-- WATCH_DEBOUNCE_SECS: Debounce between change events (e.g., 1.5)
-- INDEX_UPSERT_BATCH / INDEX_UPSERT_RETRIES / INDEX_UPSERT_BACKOFF: Upsert tuning
-- QDRANT_TIMEOUT: Request timeout in seconds for upserts/queries (e.g., 60–90)
-- MCP_TOOL_TIMEOUT_SECS: Max duration for long-running MCP tools (index/prune); default 3600s
-
-
-Reranker
-- RERANKER_ONNX_PATH, RERANKER_TOKENIZER_PATH: Paths for local ONNX cross‑encoder
-- RERANKER_ENABLED: 1/true to enable, 0/false to disable; default is enabled in server
- - Timeouts/failures automatically fall back to hybrid results
-
-Decoder (llama.cpp / GLM)
-- REFRAG_DECODER: 1 to enable decoder for context_answer; 0 to disable (default: 1)
-- REFRAG_RUNTIME: llamacpp or glm (default: llamacpp)
-- LLAMACPP_URL: llama.cpp server endpoint (default: http://llamacpp:8080 or http://host.docker.internal:8081 for GPU)
-- LLAMACPP_TIMEOUT_SEC: Decoder request timeout in seconds (default: 300)
-- DECODER_MAX_TOKENS: Max tokens for decoder responses (default: 4000)
-- REFRAG_DECODER_MODE: prompt or soft (default: prompt; soft requires patched llama.cpp)
-- GLM_API_KEY: API key for GLM provider (required when REFRAG_RUNTIME=glm)
-- GLM_MODEL: GLM model name (default: glm-4.6)
-- USE_GPU_DECODER: 1 for native Metal decoder on host, 0 for Docker (managed by gpu_toggle.sh)
-- LLAMACPP_GPU_LAYERS: Number of layers to offload to GPU, -1 for all (default: 32)
-
-ReFRAG (micro-chunking and retrieval)
-- REFRAG_MODE: 1 to enable micro-chunking and span budgeting (default: 1)
-- REFRAG_GATE_FIRST: 1 to enable mini-vector gating before dense search (default: 1)
-- REFRAG_CANDIDATES: Number of candidates for gate-first filtering (default: 200)
-- MICRO_BUDGET_TOKENS: Global token budget for context_answer spans (default: 512)
-- MICRO_OUT_MAX_SPANS: Max number of spans to return per query (default: 3)
-
-Ports
-- FASTMCP_PORT (SSE/RMCP): Override Memory MCP ports (defaults: 8000/8002)
-- FASTMCP_INDEXER_PORT (SSE/RMCP): Override Indexer MCP ports (defaults: 8001/8003)
-
-
-### Env var quick table
-
-| Name | Description | Default |
-|------|-------------|---------|
-| COLLECTION_NAME | Qdrant collection name (unified across all repos) | codebase |
-| REPO_NAME | Logical repo tag stored in payload for filtering | auto-detect from git/folder |
-| HOST_INDEX_PATH | Host path mounted at /work in containers | current repo (.) |
-| QDRANT_URL | Qdrant base URL | container: http://qdrant:6333; local: http://localhost:6333 |
-| INDEX_MICRO_CHUNKS | Enable token-based micro-chunking | 0 (off) |
-| HYBRID_EXPAND | Enable heuristic multi-query expansion | 0 (off) |
-| MAX_MICRO_CHUNKS_PER_FILE | Cap micro-chunks per file | 200 |
-| TOKENIZER_URL | HF tokenizer.json URL (for Make download) | n/a (use Make target) |
-| TOKENIZER_PATH | Local path where tokenizer is saved (Make) | models/tokenizer.json |
-| TOKENIZER_JSON | Runtime path for tokenizer (indexer) | models/tokenizer.json |
-| USE_TREE_SITTER | Enable tree-sitter parsing (py/js/ts) | 0 (off) |
-| WATCH_DEBOUNCE_SECS | Debounce between FS events (watcher) | 1.5 |
-| INDEX_UPSERT_BATCH | Upsert batch size (watcher) | 128 |
-| INDEX_UPSERT_RETRIES | Retry count (watcher) | 5 |
-| MCP_TOOL_TIMEOUT_SECS | Max duration for long-running MCP tools | 3600 |
-| INDEX_UPSERT_BACKOFF | Seconds between retries (watcher) | 0.5 |
-| QDRANT_TIMEOUT | HTTP timeout seconds | watcher: 60; search: 20 |
-| RERANKER_ONNX_PATH | Local ONNX cross-encoder model path | unset (see make setup-reranker) |
-| RERANKER_TOKENIZER_PATH | Tokenizer path for reranker | unset |
-| RERANKER_ENABLED | Enable reranker by default | 1 (enabled) |
-| FASTMCP_PORT | Memory MCP server port (SSE/RMCP) | 8000 (container-internal) |
-| FASTMCP_INDEXER_PORT | Indexer MCP server port (SSE/RMCP) | 8001 (container-internal) |
-| FASTMCP_HTTP_PORT | Memory RMCP host port mapping | 8002 |
-| FASTMCP_INDEXER_HTTP_PORT | Indexer RMCP host port mapping | 8003 |
-| FASTMCP_HEALTH_PORT | Health port (memory/indexer) | memory: 18000; indexer: 18001 |
-| LLM_EXPAND_MAX | Max alternate queries generated via LLM | 0 |
-| REFRAG_DECODER | Enable decoder for context_answer | 1 (enabled) |
-| REFRAG_RUNTIME | Decoder backend: llamacpp or glm | llamacpp |
-| LLAMACPP_URL | llama.cpp server endpoint | http://llamacpp:8080 or http://host.docker.internal:8081 |
-| LLAMACPP_TIMEOUT_SEC | Decoder request timeout | 300 |
-| DECODER_MAX_TOKENS | Max tokens for decoder responses | 4000 |
-| GLM_API_KEY | API key for GLM provider | unset |
-| GLM_MODEL | GLM model name | glm-4.6 |
-| USE_GPU_DECODER | Native Metal decoder (1) vs Docker (0) | 0 (docker) |
-| REFRAG_MODE | Enable micro-chunking and span budgeting | 1 (enabled) |
-| REFRAG_GATE_FIRST | Enable mini-vector gating | 1 (enabled) |
-| REFRAG_CANDIDATES | Candidates for gate-first filtering | 200 |
-| MICRO_BUDGET_TOKENS | Token budget for context_answer | 512 |
-
-## Running tests
-
-Local (recommended)
-- Python 3.11+
-- Create venv and install deps:
-````bash
-python3 -m venv .venv
-source .venv/bin/activate
+python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
-````
-- Run the full suite:
-````bash
pytest -q
-````
-- Run a single file or test:
-````bash
-pytest tests/test_ingest_micro_chunks.py -q
-pytest tests/test_php_support.py::test_imports -q
-````
-- Tips:
- - RERANKER_ENABLED=0 can speed up some tests locally; functionality still validated via hybrid fallback.
- - Some integration tests may start ephemeral containers via testcontainers; ensure Docker is running.
-
-Inside Docker (optional, ad-hoc)
-- You can run tests in the indexer image by overriding the entrypoint:
-````bash
-docker compose run --rm --entrypoint pytest mcp-indexer -q
-````
-Note: the provided dev images focus on runtime; local venv is faster for iterative testing.
-
-
-## Language support
-- Python, JavaScript/TypeScript, Go, Java, Rust, Shell, Terraform, PowerShell, YAML, C#, PHP
-
-## Watcher behavior and tips
-- Handles delete and move: removes/migrates points to avoid stale entries
-- Live reloads ignore patterns: changes to .qdrantignore are applied without restart
-- path_glob matches against relative paths (e.g., src/**/*.py), not absolute /work paths
-- If upserts time out, lower INDEX_UPSERT_BATCH (e.g., 96) or raise QDRANT_TIMEOUT (e.g., 90)
-- For very large files, reduce MAX_MICRO_CHUNKS_PER_FILE (e.g., 200) during dev
-
-## Expected HTTP behaviors
-- GET /mcp may return 400 (normal): the RMCP endpoint is POST-only for JSON-RPC
-- SSE requires a session handshake; raw POST /messages without it will error (expected)
-
-```bash
-curl -sSf http://localhost:6333/readyz >/dev/null && echo "Qdrant OK"
-curl -sI http://localhost:8000/sse | head -n1
-curl -sI http://localhost:8001/sse | head -n1
-```
-
-4) Single command to index + search
-```bash
-# Fresh index of your repo and a quick hybrid example
-make reindex-hard
-make qdrant-status
-make hybrid ARGS="--query 'async file watcher' --limit 5 --include-snippet"
```
-5) Example MCP client configurations
-
-Kiro (SSE):
-Create `.kiro/settings/mcp.json` in your workspace:
-````json
-{
- "mcpServers": {
- "qdrant-indexer": { "command": "npx", "args": ["mcp-remote", "http://localhost:8001/sse", "--transport", "sse-only"] },
- "memory": { "command": "npx", "args": ["mcp-remote", "http://localhost:8000/sse", "--transport", "sse-only"] }
- }
-}
-````
-
-Zed (SSE):
-Add to your Zed `settings.json` (accessed via Command Palette → "Settings: Open Settings (JSON)"):
-````json
-{
- /// The name of your MCP server
- "qdrant-indexer": {
- /// The command which runs the MCP server
- "command": "npx",
- /// The arguments to pass to the MCP server
- "args": [
- "mcp-remote",
- "http://localhost:8001/sse",
- "--transport",
- "sse-only"
- ],
- /// The environment variables to set
- "env": {}
- }
-}
-````
-Notes:
-- Zed expects MCP servers at the root level of settings.json
-- Uses command/args (stdio). mcp-remote bridges to remote SSE endpoints
-- If npx prompts, add `-y` right after npx: `"command": "npx", "args": ["-y", "mcp-remote", ...]`
-- Alternative: Use direct HTTP connection if mcp-remote has issues:
- ```json
- {
- "qdrant-indexer": {
- "type": "http",
- "url": "http://localhost:8001/sse"
- }
- }
- ```
-- For Qodo (RMCP) clients, see "Qodo Integration (RMCP config)" below for the direct `url`-based snippet.
-
-6) Common troubleshooting
-- Tree-sitter not found or parser errors:
- - Feature is optional. If you set USE_TREE_SITTER=1 and see errors, unset it or install tree-sitter deps, then reindex.
-- Tokenizer missing for micro-chunks:
- - Run make tokenizer or set TOKENIZER_JSON to a valid tokenizer.json; otherwise we fall back to line-based chunking.
-- SSE “Invalid session ID” when POSTing /messages directly:
- - Expected if you didn’t initiate an SSE session first. Use an MCP client (e.g., mcp-remote) to handle the handshake.
-- llama.cpp platform warning on Apple Silicon:
- - Prefer the native path above (`scripts/gpu_toggle.sh gpu`). If you stick with Docker, add `platform: linux/amd64` to the service or ignore the warning during local dev.
-- Indexing feels stuck on very large files:
- - Use MAX_MICRO_CHUNKS_PER_FILE=200 during dev runs.
-
-
-- Watcher timeouts (-9) or Qdrant "ResponseHandlingException: timed out":
- - Set watcher-safe defaults to reduce payload size and add headroom during upserts:
-
- ````ini
- # Watcher-safe defaults (compose already applies these to the watcher service)
- QDRANT_TIMEOUT=60
- MAX_MICRO_CHUNKS_PER_FILE=200
- INDEX_UPSERT_BATCH=128
- INDEX_UPSERT_RETRIES=5
- INDEX_UPSERT_BACKOFF=0.5
- WATCH_DEBOUNCE_SECS=1.5
- ````
-
-
- - If issues persist, try lowering INDEX_UPSERT_BATCH to 96 or raising QDRANT_TIMEOUT to 90.
-
-ReFRAG background: https://arxiv.org/abs/2509.01092
-
-Endpoints
-
-| Component | URL |
-|-------------|------------------------------|
-| Memory MCP | http://localhost:8000/sse |
-| Indexer MCP | http://localhost:8001/sse |
-| Qdrant DB | http://localhost:6333 |
-
-
-### Streamable HTTP (RMCP) endpoints + OpenAI Codex config
-
-- Memory HTTP (RMCP): http://localhost:8002/mcp
-- Indexer HTTP (RMCP): http://localhost:8003/mcp
+> **See [docs/DEVELOPMENT.md](docs/DEVELOPMENT.md) for full development setup.**
-OpenAI Codex config (RMCP client):
+## Endpoints
-````toml
-experimental_use_rmcp_client = true
+| Component | SSE | RMCP |
+|-----------|-----|------|
+| Memory MCP | http://localhost:8000/sse | http://localhost:8002/mcp |
+| Indexer MCP | http://localhost:8001/sse | http://localhost:8003/mcp |
+| Qdrant DB | http://localhost:6333 | - |
+| Decoder | http://localhost:8080 | - |
-[mcp_servers.memory_http]
-url = "http://127.0.0.1:8002/mcp"
+> **See [docs/IDE_CLIENTS.md](docs/IDE_CLIENTS.md) for client setup and [docs/TROUBLESHOOTING.md](docs/TROUBLESHOOTING.md) for common issues.**
-[mcp_servers.qdrant_indexer_http]
-url = "http://127.0.0.1:8003/mcp"
-````
-
-
-### Kiro Integration (workspace config)
-
-Add this to your workspace-level Kiro config at `.kiro/settings/mcp.json` (restart Kiro after saving):
-
-````json
-{
- "mcpServers": {
- "qdrant-indexer": { "command": "npx", "args": ["mcp-remote", "http://localhost:8001/sse", "--transport", "sse-only"] },
- "memory": { "command": "npx", "args": ["mcp-remote", "http://localhost:8000/sse", "--transport", "sse-only"] }
- }
-}
-````
-
-Notes:
-- Kiro expects command/args (stdio). `mcp-remote` bridges to remote SSE endpoints.
-- If `npx` prompts in your environment, add `-y` right after `npx`.
-- Workspace config overrides user-level config (`~/.kiro/settings/mcp.json`).
-
-Troubleshooting:
-- Error: “Enabled MCP Server must specify a command, ignoring.”
- - Fix: Use the `command`/`args` form above; do not use `type:url` in Kiro.
-- ImportError: `deps: No module named 'scripts'` when calling `memory_store` on the indexer MCP
- - Fix applied: server now adds `/work` and `/app` to `sys.path`. Restart `mcp_indexer`.
-
-## Available MCP tools
-
-Memory MCP (8000 SSE, 8002 RMCP):
-- store(information, metadata?, collection?) — write a memory entry into the default collection (dual vectors: dense + lexical)
-- find(query, limit=5, collection?, top_k?) — hybrid memory search over memory-like entries
-
-Indexer/Search MCP (8001 SSE, 8003 RMCP):
-- repo_search — hybrid code search (dense + lexical + optional reranker)
-- context_search — search that can also blend memory results (include_memories)
-- context_answer — natural-language Q&A with retrieval + local LLM (llama.cpp or GLM)
-- code_search — alias of repo_search
-- repo_search_compat — permissive wrapper that normalizes q/text/queries/top_k payloads
-- context_answer_compat — permissive wrapper for context_answer with lenient argument handling
-- expand_query(query, max_new?) — LLM-assisted query expansion (generates 1-2 alternates)
-- qdrant_index_root — index /work (mounted repo root) with safe defaults
-- qdrant_index(subdir?, recreate?, collection?) — index a subdir or recreate collection
-- qdrant_prune — remove points for missing files or file_hash mismatch
-- qdrant_list — list Qdrant collections
-- qdrant_status — collection counts and recent ingestion timestamps
-- workspace_info(workspace_path?) — read .codebase/state.json and resolve default collection
-- list_workspaces(search_root?) — scan for multiple workspaces in multi-repo environments
-- memory_store — convenience memory store from the indexer (uses default collection)
-- search_tests_for — intent wrapper for test files
-- search_config_for — intent wrapper for likely config files
-- search_callers_for — intent wrapper for probable callers/usages
-- search_importers_for — intent wrapper for files importing a module/symbol
-- change_history_for_path(path) — summarize recent changes using stored metadata
-- collection_map - return collection↔repo mappings
-- default_collection - set the collection to use for the session
-
-Notes:
-- Most search tools accept filters like language, under, path_glob, kind, symbol, ext.
-- Reranker enabled by default; timeouts fall back to hybrid results.
-- context_answer requires decoder enabled (REFRAG_DECODER=1) with llama.cpp or GLM backend.
-
-### Qodo Integration (RMCP config)
-
-Add this to your Qodo MCP settings to target the RMCP (HTTP) endpoints:
+ReFRAG background: https://arxiv.org/abs/2509.01092
-````json
-{
- "mcpServers": {
- "memory": { "url": "http://localhost:8002/mcp" },
- "qdrant-indexer": { "url": "http://localhost:8003/mcp" }
- }
-}
-````
+---
-Note: Qodo can talk to the RMCP endpoints directly, so no `mcp-remote` wrapper is required.
-
-
-## Architecture overview
-
-- Agents connect via MCP over SSE:
- - Memory MCP: http://localhost:8000/sse
- - Indexer MCP: http://localhost:8001/sse
-- Both MCP servers talk to Qdrant inside Docker at http://qdrant:6333 (DB HTTP API)
-- Supporting jobs (indexer, watcher, init_payload) write to/read from Qdrant directly
+## Architecture
```mermaid
flowchart LR
@@ -791,804 +284,11 @@ flowchart LR
class G opt
```
-## Production-ready local development
-## One-line bring-up (ship-ready)
-
-Start Qdrant, the Memory MCP (8000), the Indexer MCP (8001), and run a fresh index of your current repo:
-
-```bash
-HOST_INDEX_PATH="$(pwd)" FASTMCP_INDEXER_PORT=8001 docker compose up -d qdrant mcp mcp_indexer indexer watcher
-```
-
-Then wire your MCP-aware IDE/tooling to:
-- Memory MCP: http://localhost:8000/sse
-- Indexer MCP: http://localhost:8001/sse
-
-Tip: add `watcher` to the command if you want live reindex-on-save.
-
-### SSE Memory Server (port 8000)
-
-- URL: http://localhost:8000/sse
-- Tools: `store`, `find`
-- Env (used by the indexer to blend memory):
- - `MEMORY_SSE_ENABLED=true`
- - `MEMORY_MCP_URL=http://mcp:8000/sse`
- - `MEMORY_MCP_TIMEOUT=6`
-
-IDE/Agent config (recommended):
-
-```json
-{
- "mcpServers": {
- "memory": { "type": "sse", "url": "http://localhost:8000/sse", "disabled": false },
- "qdrant-indexer": { "type": "sse", "url": "http://localhost:8001/sse", "disabled": false }
- }
-}
-```
-
-Blended search:
-
-## Memory usage patterns (how to get the most from memories)
-
-### When to use memories vs code search
-- Use memories when the information isn’t in your repository or is transient/user-authored: conventions, runbooks, decisions, links, known issues, FAQs, “how we do X here”.
-- Use code search for facts that live in the repo: APIs, functions/classes, configuration, and cross-file relationships.
-- Blend both for tasks like “how to run E2E tests” where instructions (memory) reference scripts in the repo (code).
-- Rule of thumb: if you’d write it in a team wiki or ticket comment, store it as a memory; if you’d grep for it, use code search.
-
-### Recommended metadata schema (best practices)
-We store memory entries as points in Qdrant with a small, consistent payload. Recommended keys:
-- kind: "memory" (string) – required. Enables filtering and blending.
-- topic: short category string (e.g., "dev-env", "release-process").
-- tags: list of strings (e.g., ["qdrant", "indexing", "prod"]).
-- source: where this came from (e.g., "chat", "manual", "tool", "issue-123").
-- author: who added it (e.g., username or email).
-- created_at: ISO8601 timestamp (UTC).
-- expires_at: ISO8601 timestamp if this memory should be pruned later.
-- repo: optional repo identifier if sharing a Qdrant instance across repos.
-- link: optional URL to docs, tickets, or dashboards.
-- priority: 0.0–1.0 weight that clients can use to bias ranking when blending.
-
-Notes:
-- Keep values small (short strings, small lists). Don’t store large blobs in payload; put details in the `information` text.
-- Use lowercase snake_case keys for consistency.
-- For secrets/PII: do not store plaintext. Store references or vault paths instead.
-
-### Example memory operations
-Store a memory (via MCP Memory server tool `store` – use your MCP client):
-```
-{
- "information": "Run full reset: INDEX_MICRO_CHUNKS=1 MAX_MICRO_CHUNKS_PER_FILE=200 make reset-dev",
- "metadata": {
- "kind": "memory",
- "topic": "dev-env",
- "tags": ["make", "reset"],
- "source": "chat"
- }
-}
-```
-
-Find memories (via MCP Memory server tool `find`):
-```
-{
- "query": "reset-dev",
- "limit": 5
-}
-```
-
-Blend memories into code search (Indexer MCP `context_search`):
-```
-{
- "query": "async file watcher",
- "include_memories": true,
- "limit": 5,
- "include_snippet": true
-}
-```
-
-Tips:
-- Use precise queries (2–5 tokens). Add a couple synonyms if needed; the server supports multiple phrasings.
-- Combine `topic`/`tags` in your memory text to make them easier to find (they also live in payload for filtering).
-
-### Backup and migration (advanced)
-For production-grade backup/migration strategies, see the official Qdrant documentation for snapshots and export/import. For local development, we recommend relying on Docker volumes and reindexing when needed.
-
-Operational notes:
-- Collection name comes from `COLLECTION_NAME` (see .env). This stack defaults to a single collection for both code and memories; filtering uses `metadata.kind`.
-- If you switch to a dedicated memory collection, update the MCP Memory server and the Indexer's memory blending env to point at it.
-- Consider pruning expired memories by filtering `expires_at < now`.
-
-- Call `context_search` on :8001 (SSE) or :8003 (RMCP) with `{ "include_memories": true }` to return both memory and code results.
-
-### Collection Naming Strategies
-
-Different hash lengths are used for different workspace types:
-
-**Local Workspaces:** `repo-name-8charhash`
-- Example: `Anesidara-e8d0f5fc`
-- Used by local indexer/watcher
-- Assumes unique repo names within workspace
-
-**Remote Uploads:** `folder-name-16charhash-8charhash`
-- Example: `testupload2-04e680d5939dd035-b8b8d4cc`
-- Collision avoidance for duplicate folder names for different codebases
-- 16-char hash identifies workspace, 8-char hash identifies collection
-
-
-### Enable memory blending (for context_search)
-
-1) Ensure the Memory MCP is running on :8000 (default in compose).
-2) Enable SSE memory blending on the Indexer MCP by setting these env vars for the mcp_indexer service (docker-compose.yml):
-
-
-````yaml
-services:
- mcp_indexer:
- environment:
- - MEMORY_SSE_ENABLED=true
- - MEMORY_MCP_URL=http://mcp:8000/sse
- - MEMORY_MCP_TIMEOUT=6
-````
-
-
-3) Restart the indexer service:
-
-````bash
-docker compose up -d mcp_indexer
-````
-
-
-4) Validate by calling context_search with include_memories=true for a query that matches a stored memory:
-
-
-````json
-{
- "query": "your test memory text",
- "include_memories": true,
- "limit": 5
-}
-````
-
-
-Expected: non-zero results with blended items; memory hits will have memory-like payloads (e.g., metadata.kind = "memory").
-
-
-- Idempotent + incremental indexing out of the box:
- - Skips unchanged files automatically using a file content hash stored in payload (metadata.file_hash)
- - De-duplicates per-file points by deleting prior entries for the same path before insert
- - Payload indexes are auto-created on first run (metadata.language, metadata.path_prefix, metadata.repo, metadata.kind, metadata.symbol, metadata.symbol_path, metadata.imports, metadata.calls)
-- Commands:
- - Full rebuild: `make reindex`
- - Fast incremental: `make index` (skips unchanged files)
- - Health check: `make health` (verifies collection vector name/dim, HNSW, and filtered queries with kind/symbol)
- - Hybrid search: `make hybrid` (dense + lexical bump with RRF)
-- Bootstrap all services + index + checks: `make bootstrap`
-- Discover commands: `make help` lists all targets and descriptions
-
-- Ingest Git history: `make history` (messages + file lists)
- - If the repo has no local commits yet, the history ingester will shallow-fetch from the remote (default: origin) and use its HEAD. Configure with `--remote` and `--fetch-depth`.
-- Local reranker (ONNX): `make rerank-local` (set RERANKER_ONNX_PATH and RERANKER_TOKENIZER_PATH)
-- Setup ONNX reranker quickly: `make setup-reranker ONNX_URL=... TOKENIZER_URL=...` (updates .env paths)
-- Enable Tree-sitter parsing (more accurate symbols/scopes): set `USE_TREE_SITTER=1` in `.env` then reindex
-
-- Flags (advanced):
- - Disable de-duplication: `docker compose run --rm indexer --root /work --no-dedupe`
- - Disable unchanged skipping: `docker compose run --rm indexer --root /work --no-skip-unchanged`
-
-Notes:
-- Named vector remains aligned with the MCP server (fast-bge-base-en-v1.5). If you change EMBEDDING_MODEL, run `make reindex` to recreate the collection.
-- For very large repos, consider running `make index` on a schedule (or pre-commit) to keep Qdrant warm without full reingestion.
-
-### Multi-repo indexing (unified search)
-
-The stack uses a **single unified `codebase` collection** by default, making multi-repo search seamless:
-
-**Index another repo into the same collection:**
-```bash
-# From your qdrant directory
-make index-here HOST_INDEX_PATH=/path/to/other/repo REPO_NAME=other-repo
-
-# Or with full control:
-HOST_INDEX_PATH=/path/to/other/repo \
-COLLECTION_NAME=codebase \
-REPO_NAME=other-repo \
-docker compose run --rm indexer --root /work
-```
-
-**What happens:**
-- Files from the other repo get indexed into the unified `codebase` collection
-- Each file is tagged with `metadata.repo = "other-repo"` for filtering
-- Search across all repos by default, or filter by specific repo
-
-**Search examples:**
-```bash
-# Search across all indexed repos
-make hybrid QUERY="authentication logic"
-
-# Filter by specific repo
-python scripts/hybrid_search.py \
- --query "authentication logic" \
- --repo other-repo
-
-# Filter by repo + language
-python scripts/hybrid_search.py \
- --query "authentication logic" \
- --repo other-repo \
- --language python
-```
-
-**Benefits:**
-- One collection = unified search across all your code
-- No fragmentation or collection management overhead
-- Filter by repo when you need isolation
-- All repos share the same vector space for better semantic search
-
-### Multi-query re-ranker (no new deps)
-
-- Run a fused query with several phrasings and metadata-aware boosts:
-
-```bash
-make rerank
-```
-
-- Customize:
- - Add more `--query` flags
- - Prefer language: `--language python`
- - Prefer under path: `--under /work/scripts`
-
-### Watch mode (incremental indexing)
-
-- Reindex changed files on save (runs until Ctrl+C):
-
-```bash
-make watch
-```
-
-### HNSW recall tuning
-
-- Collection creation is tuned for higher recall: `m=16`, `ef_construct=256`.
-- If you change embeddings, run `make reindex` to recreate the collection with the tuned HNSW settings.
-
-### Warm start (optional)
-
-- Preload the embedding model and warm Qdrant's HNSW search path to reduce first-query latency and improve recall:
-
-```bash
-make warm
-```
-
-
-
-
-
-
-
-Or, since this stack already exposes SSE, you can configure the client to use `http://localhost:8000/sse` directly (recommended for Cursor/Windsurf).
-
-### Search filters (repo_search/context_search)
-
-Most MCP clients let you pass structured tool arguments. The Indexer/search MCP supports applying server-side filters in repo_search/context_search when these keys are present:
-- `language`: value matches `metadata.language`
-- `path_prefix`: value matches `metadata.path_prefix` (e.g., `/work/src`)
-- `kind`: value matches `metadata.kind` (e.g., `function`, `class`, `method`)
-
-Tip: Combine multiple query phrasings and apply these filters for best precision on large codebases.
-
-
-## Notes
-
-## Index your repository (code search quality)
-
-We added a dockerized indexer that chunks code, embeds with `BAAI/bge-base-en-v1.5`, and stores metadata (`path`, `path_prefix`, `language`, `start_line`, `end_line`, `code`) in Qdrant. This boosts recall and relevance for the MCP tools.
-
-```bash
-# Index current workspace (does not drop data)
-make index
-
-# Full reindex (drops existing points in the collection)
-make reindex
-
-### Companion MCP: Index/Prune/List (Option B)
-
-A second MCP server runs alongside the search MCP and exposes tools:
-- qdrant-list: list collections
-- qdrant-index: index the mounted path (/work or subdir)
-- qdrant-prune: prune stale points for the mounted path
-
-Configuration
-- FASTMCP_INDEXER_PORT (default 8001)
-- HOST_INDEX_PATH bind-mounts the target repo into /work (read-only)
-
-Add to your agent as a separate MCP endpoint (SSE):
-- URL: http://localhost:8001/sse
-
-Example calls (semantics vary by client):
-- qdrant-index with args {"subdir":"scripts","recreate":true}
-
-### MCP client configuration examples
-
-Roo (SSE/RMCP):
-
-```json
-{
- "mcpServers": {
- "memory": { "type": "sse", "url": "http://localhost:8000/sse", "disabled": false },
- "qdrant-indexer": { "type": "sse", "url": "http://localhost:8001/sse", "disabled": false }
- }
-}
-```
-
-Cline (SSE/RMCP):
-
-```json
-{
- "mcpServers": {
- "memory": { "type": "sse", "url": "http://localhost:8000/sse", "disabled": false },
- "qdrant-indexer": { "type": "sse", "url": "http://localhost:8001/sse", "disabled": false }
- }
-}
-```
-
-Windsurf (SSE/RMCP):
-
-```json
-{
- "mcpServers": {
- "memory": { "type": "sse", "url": "http://localhost:8000/sse", "disabled": false },
- "qdrant-indexer": { "type": "sse", "url": "http://localhost:8001/sse", "disabled": false }
- }
-}
-```
-
-Windsurf/Cursor (stdio for search + SSE for indexer):
-
-```json
-{
- "mcpServers": {
- "qdrant": {
- "command": "uvx",
- "args": ["mcp-server-qdrant"],
- "env": {
- "QDRANT_URL": "http://localhost:6333",
- "COLLECTION_NAME": "my-collection",
- "EMBEDDING_MODEL": "BAAI/bge-base-en-v1.5"
- },
- "disabled": false
- }
- }
-}
-```
-
-Augment (SSE for both servers – recommended):
-
-```json
-{
- "mcpServers": {
- "memory": { "type": "sse", "url": "http://localhost:8000/sse", "disabled": false },
- "qdrant-indexer": { "type": "sse", "url": "http://localhost:8001/sse", "disabled": false }
- }
-}
-```
-
-Qodo (RMCP; add each tool individually):
-
-**Note**: In Qodo, you must add each MCP tool separately through the UI, not as a single JSON config.
-
-For each tool, use this format:
-
-**Tool 1 - memory:**
-```json
-{
- "memory": { "url": "http://localhost:8002/mcp" }
-}
-```
-
-**Tool 2 - qdrant-indexer:**
-```json
-{
- "qdrant-indexer": { "url": "http://localhost:8003/mcp" }
-}
-```
-
-#### Important for IDE agents (Cursor/Windsurf/Augment)
-- Do not send null values to MCP tools. Omit the field or pass an empty string "" instead.
-- qdrant-index examples:
- - {"subdir":"","recreate":false,"collection":"my-collection","repo_name":"workspace"}
- - {"subdir":"scripts","recreate":true}
-- For indexing the repo root with no params, use the zero-arg tool `qdrant_index_root` (new) or call `qdrant-index` with `subdir:""`.
-
-
-##### Zero-config search tool (new)
-- repo_search: run code search without filters or config.
- - Structured fields supported (parity with DSL): language, under, kind, symbol, ext, not_, case, path_regex, path_glob, not_glob
- - Response shaping: compact (bool) returns only path/start_line/end_line
- - Smart default: compact=true when query is an array with multiple queries (unless explicitly set)
- - If include_snippet is true, compact is forced off so snippet fields are returned
-
- - Glob fields accept a single string or an array; you can also pass a comma-separated string which will be split
- - Query parsing: accepts query or queries; JSON arrays, JSON-stringified arrays, comma-separated strings; also supports q/text aliases
-
- - Parity note: path_glob/not_glob list handling works in both modes — in-process and subprocess — with OR semantics for path_glob and reject-on-any for not_glob.
- - Examples:
- - {"query": "semantic chunking"}
- - {"query": ["function to split code", "overlapping chunks"], "limit": 15, "per_path": 3}
- - {"query": "watcher debounce", "language": "python", "under": "scripts/", "include_snippet": true, "context_lines": 2}
- - {"query": "parser", "ext": "ts", "path_regex": "/services/.+", "compact": true}
- - {"query": "adapter", "path_glob": ["**/src/**", "**/pkg/**"], "not_glob": "**/tests/**"}
- - Returns structured results: score, path, symbol, start_line, end_line, and optional snippet; or compact form.
-- code_search: alias of repo_search (same args) for easier discovery in some clients.
-
-- qdrant_status: return collection size and last index times (safe, read-only).
- - {"collection": "my-collection"}
-
-
-Verification:
-- You should see tools from both servers (e.g., `store`, `find`, `repo_search`, `code_search`, `context_search`, `qdrant_list`, `qdrant_index`, `qdrant_prune`, `qdrant_status`).
-- Call `qdrant_list` to confirm Qdrant connectivity.
-- Call `qdrant_index` with args like `{ "subdir": "scripts", "recreate": true }` to (re)index the mounted repo.
-- Call `context_search` with `{ "include_memories": true }` to blend memory+code (requires enabling MEMORY_SSE_ENABLED on the indexer service).
-
-- qdrant_list with no args
-- qdrant_prune with no args
-
-
-Notes:
-- The indexer reads env from `.env` (QDRANT_URL, COLLECTION_NAME, EMBEDDING_MODEL).
-- Default chunking: ~120 lines with 20-line overlap.
-- Skips typical build/venv directories.
-- Populates `metadata.kind`, `metadata.symbol`, and `metadata.symbol_path` for Python/JS/TS/Go/Java/Rust/Terraform (best-effort), per chunk.
-- Uses the same collection as the MCP server.
-
-### Exclusions (.qdrantignore) and defaults
-
-- The indexer now supports a `.qdrantignore` file at the repo root (similar to `.gitignore`). Use it to exclude directories/files from indexing.
-- Sensible defaults are excluded automatically (overridable): `/models`, `/node_modules`, `/dist`, `/build`, `/.venv`, `/venv`, `/__pycache__`, `/.git`, and files matching `*.onnx`, `*.bin`, `*.safetensors`, `tokenizer.json`, `*.whl`, `*.tar.gz`.
-- Override via env or flags:
- - Env: `QDRANT_DEFAULT_EXCLUDES=0` to disable defaults; `QDRANT_IGNORE_FILE=.myignore`; `QDRANT_EXCLUDES='tokenizer.json,*.onnx,/third_party'`
- - CLI examples:
- - `docker compose run --rm indexer --root /work --ignore-file .qdrantignore`
- - `docker compose run --rm indexer --root /work --no-default-excludes --exclude '/vendor' --exclude '*.bin'`
-
-### Scaling and tuning (small → large codebases)
-
-- Chunking and batching are tunable via env or flags:
- - `INDEX_CHUNK_LINES` (default 120), `INDEX_CHUNK_OVERLAP` (default 20)
- - `INDEX_BATCH_SIZE` (default 64)
- - `INDEX_PROGRESS_EVERY` (default 200 files; 0 disables)
-### Prune stale points (optional)
-
-If files were deleted or significantly changed outside the indexer, remove stale points safely:
-
-```bash
-make prune
-```
-
-- CLI equivalents: `--chunk-lines`, `--chunk-overlap`, `--batch-size`, `--progress-every`.
-- Recommendations:
- - Small repos (<100 files): chunk 80–120, overlap 16–24, batch-size 32–64
- - Medium (100s–1k files): chunk 120–160, overlap ~20, batch-size 64–128
- - Large monorepos (1k+): start with defaults; consider `INDEX_PROGRESS_EVERY=200` for visibility and `INDEX_BATCH_SIZE=128` if RAM allows
-
-
-## ReFRAG micro-chunking (retrieval-side, production-ready)
-
-ReFRAG-lite is enabled in this repo and can be toggled via env. It provides:
-- Token-level micro-chunking at ingest (tiny k-token windows with stride)
-- Compact vector gating and optional gate-first candidate restriction
-- Span compaction and a global token budget at search time
-
-Enable and tune:
-
-````ini
-# Enable compressed retrieval with micro-chunks
-REFRAG_MODE=1
-INDEX_MICRO_CHUNKS=1
-
-# Micro windowing
-MICRO_CHUNK_TOKENS=16
-MICRO_CHUNK_STRIDE=8
-
-# Output shaping and budget
-MICRO_OUT_MAX_SPANS=3
-MICRO_MERGE_LINES=4
-MICRO_BUDGET_TOKENS=512
-MICRO_TOKENS_PER_LINE=32
-
-# Optional: gate-first using mini vectors to prefilter dense search
-REFRAG_GATE_FIRST=0
-REFRAG_CANDIDATES=200
-````
-
-Reindex after changing chunking:
-
-````bash
-# Recreate collection (safe for local dev)
-docker compose exec mcp_indexer python -c "from scripts.mcp_indexer_server import qdrant_index_root; qdrant_index_root(recreate=True)"
-````
-
-What results look like (context_search / code_search return shape):
-
-````json
-{
- "score": 0.9234,
- "path": "scripts/ingest_code.py",
- "start_line": 120,
- "end_line": 148,
- "span_budgeted": true,
- "budget_tokens_used": 224,
- "components": { "dense": 0.78, "lex": 0.35, "mini": 0.81 },
- "why": ["dense", "mini"]
-}
-````
-
-Notes:
-- span_budgeted=true indicates adjacent micro hits were merged and counted toward the global token budget.
-- Tune MICRO_* to control prompt footprint. Increase MICRO_MERGE_LINES to merge looser spans; reduce MICRO_OUT_MAX_SPANS for more file diversity.
-- Gate-first reduces dense search compute on large collections; keep off for tiny repos.
-
-
-## Decoder-path ReFRAG (feature-flagged)
-
-This stack ships a feature-flagged decoder integration path via a llama.cpp sidecar.
-It is production-safe by default (off) and can run in a fallback “prompt” mode
-that uses a compressed textual context. A future “soft” mode will inject projected
-chunk embeddings into a patched llama.cpp server.
-
-
-### Decoder-path dataflow (compress → sense → expand)
-
-```mermaid
-flowchart LR
- %% Retrieval side
- Q[Query] --> R[Hybrid search + span budgeting]
- R --> S[Selected micro-spans]
-
- %% Projection (φ) and modes
- S -->|project via φ| P[(Soft embeddings)]
- S -. prompt compress .-> C[Compressed prompt]
-
- %% Decoder service
- subgraph Decoder
- G[[llama.cpp :8080]]
- end
-
- %% Mode routing
- P -->|soft mode| G
- C -->|prompt mode| G
-
- %% Output
- G --> O[Completion]
-
- %% Notes
- classDef opt stroke-dasharray: 5 5
- class C opt
-```
-
-Enable (safe default is off):
-
-````ini
-REFRAG_DECODER=1
-REFRAG_RUNTIME=llamacpp
-LLAMACPP_URL=http://llamacpp:8080
-REFRAG_DECODER_MODE=prompt # prompt|soft (soft requires patched llama.cpp)
-REFRAG_ENCODER_MODEL=BAAI/bge-base-en-v1.5
-REFRAG_PHI_PATH=/work/models/refrag_phi_768_to_dmodel.json
-````
-
-Bring up llama.cpp sidecar (optional):
-
-````bash
-docker compose up -d llamacpp
-````
-
-Make-based provisioning (recommended):
-
-````bash
-# downloads a tiny GGUF to ./models/model.gguf (override URL via LLAMACPP_MODEL_URL)
-make llamacpp-up
-# or just fetch the model without starting the service
-make llama-model
-````
-
-Optional: bake the model into the image (no host volume required):
-
-````bash
-# builds an image that includes the model specified by MODEL_URL
-make llamacpp-build-image LLAMACPP_MODEL_URL=https://huggingface.co/.../tiny.gguf
-# then in docker-compose.yml, either remove the ./models volume for llamacpp
-# or override the service to use image: context-llamacpp:tiny
-````
-
-
-Programmatic use:
-
-````python
-from scripts.refrag_llamacpp import LlamaCppRefragClient
-c = LlamaCppRefragClient() # uses LLAMACPP_URL
-text = c.generate_with_soft_embeddings("Question: ...\n", soft_embeddings=None, max_tokens=128)
-````
-
-
-Notes:
-- φ file format: JSON 2D array with shape (d_in, d_model). See scripts/refrag_phi.py. Set REFRAG_PHI_PATH to your JSON file.
-
-- In prompt mode, the client calls /completion on the llama.cpp server with a compressed prompt.
-- In soft mode, the client will require a patched server to accept soft embeddings. The flag ensures no breakage.
-
-
-### Alternative: GLM API Provider
-
-Instead of running llama.cpp locally, you can use the GLM API (ZhipuAI) as your decoder backend:
-
-**Setup:**
-```bash
-# In .env
-REFRAG_DECODER=1
-REFRAG_RUNTIME=glm # Switch from llamacpp to glm
-GLM_API_KEY=your-api-key # Required
-GLM_MODEL=glm-4.6 # Optional, defaults to glm-4.6
-```
-
-**How it works:**
-- Uses OpenAI SDK with `base_url="https://api.z.ai/api/paas/v4/"`
-- Supports prompt mode only (soft embeddings ignored)
-- Handles GLM-4.6's reasoning mode (`reasoning_content` field)
-- Drop-in replacement for llama.cpp—same interface, no code changes needed
-
-**Switch back to llama.cpp:**
-```bash
-REFRAG_RUNTIME=llamacpp
-```
-
-The GLM provider is implemented in `scripts/refrag_glm.py` and automatically selected when `REFRAG_RUNTIME=glm`.
-
-
-## How context_answer works (with decoder)
-
-The `context_answer` MCP tool answers natural-language questions using retrieval + a decoder sidecar.
-
-- Inputs (most relevant): `query`, `limit`, `per_path`, `budget_tokens`, `include_snippet`, `collection`, `language`, `path_glob/not_glob`
-- Outputs:
- - `answer` (string)
- - `citations`: `[ { path, start_line, end_line, container_path? }, ... ]`
- - `query`: list of query strings actually used
- - `used`: `{ "gate_first": true|false, "refrag": true|false }`
-
-Pipeline
-1) Hybrid search (gate-first): Uses MINI-vector gating when `REFRAG_GATE_FIRST=1` to prefilter candidates, then runs dense+lexical fusion
-2) Micro-span budgeting: Merges adjacent micro hits and applies a global token budget (`REFRAG_MODE=1`, `MICRO_BUDGET_TOKENS`, `MICRO_OUT_MAX_SPANS`)
-3) Prompt assembly: Builds compact context blocks and a “Sources” footer
-4) Decoder call: When `REFRAG_DECODER=1`, calls the configured runtime (`REFRAG_RUNTIME=llamacpp` or `glm`) to synthesize the final answer
-5) Return: Answer + citations + usage flags; errors keep citations for debugging
-
-Environment toggles
-- Retrieval: `REFRAG_MODE=1`, `REFRAG_GATE_FIRST=1`, `REFRAG_CANDIDATES=200`
-- Budgeting/output: `MICRO_BUDGET_TOKENS`, `MICRO_OUT_MAX_SPANS`
-- Decoder: `REFRAG_DECODER=1`, `LLAMACPP_URL=http://localhost:8080`
-
-Fallbacks and safety
-- If gate-first yields 0 items and no strict language filter is set, the tool automatically retries without gating
-- If the decoder call fails, the response contains `{ "error": "..." }` plus `citations`, so you can still inspect sources
-
-Quick health + example
-```bash
-# Decoder health (llama.cpp sidecar)
-curl -s http://localhost:8080/health
-
-# Qdrant
-curl -sSf http://localhost:6333/readyz >/dev/null && echo "Qdrant OK"
-```
-
-```python
-# Minimal local call (uses the running MCP indexer server code)
-import os, asyncio
-os.environ.update(
- QDRANT_URL="http://localhost:6333",
- COLLECTION_NAME="my-collection",
- REFRAG_MODE="1", REFRAG_GATE_FIRST="1",
- REFRAG_DECODER="1", LLAMACPP_URL="http://localhost:8080",
-)
-from scripts import mcp_indexer_server as srv
-async def t():
- out = await srv.context_answer(query="How does hybrid search work?", limit=5)
- print(out["used"], len(out.get("citations", [])), len(out.get("answer", "")))
-asyncio.run(t())
-```
-
-Implementation
-- See `scripts/mcp_indexer_server.py` (`context_answer` tool) for the full pipeline, env knobs, and debug flags (`DEBUG_CONTEXT_ANSWER=1`).
-
-### MCP search filtering (language, path, kind)
-
-- The indexer creates payload indexes for efficient filtering.
-- When querying (via MCP client or scripts), you can filter by:
- - `metadata.language` (e.g., python, typescript, javascript, go, rust)
- - `metadata.path_prefix` (e.g., `/work/src`)
- - `metadata.kind` (e.g., function, class, method)
-- Example: in the provided reranker script you can do:
-
-```bash
-make rerank ARGS="--language python --under /work/scripts"
-
-### Operational safeguards and troubleshooting
-
-- Tokenizer for micro-chunking: set TOKENIZER_JSON to a valid tokenizer.json path (default: models/tokenizer.json). If missing, the indexer falls back to line-based chunking.
-- Cap micro-chunks per file: MAX_MICRO_CHUNKS_PER_FILE (default 2000) to prevent runaway chunk counts on very large files.
-- Qdrant client timeout: QDRANT_TIMEOUT (seconds, default 20) applies to all MCP Qdrant calls.
-- Memory auto-detect caching: MEMORY_AUTODETECT=1 by default with MEMORY_COLLECTION_TTL_SECS (default 300s) to avoid repeatedly sampling all collections.
-- Schema repair: ensure_collection now repairs missing named vectors (lex, and mini when REFRAG_MODE=1) on existing collections.
-
-```
-
-- Direct Qdrant filter example is shown below; most MCP clients allow passing tool args that map to server-side filters. If your client supports adding structured args to `qdrant-find`, prefer these filters to reduce noise.
-
-
-### Payload indexes (created for you)
-
-We create payload indexes to accelerate filtered searches:
-- `metadata.language` (keyword)
-- `metadata.path_prefix` (keyword)
-- `metadata.repo` (keyword)
-- `metadata.kind` (keyword)
-- `metadata.symbol` (keyword)
-- `metadata.symbol_path` (keyword)
-- `metadata.imports` (keyword)
-- `metadata.calls` (keyword)
-- `metadata.file_hash` (keyword)
-- `metadata.ingested_at` (keyword)
-- Git history fields available in payload: `commit_id`, `author_name`, `authored_date`, `message`, `files`
-
-Payload indexes enable fast server-side filters (e.g., language, path_prefix, kind, symbol). Prefer using the MCP tools repo_search/context_search with filter arguments rather than raw Qdrant REST/Python snippets. See the Qdrant documentation if you need low-level API examples.
-### Best-practice querying
-
-- Use precise intent + language: “python chunking function for Qdrant indexing”
-- Add path hints when you know the area: “under scripts or ingestion code”
-- Try 2–3 alternative phrasings (multi-query) and pick the consensus
-- Prefer results where `metadata.language` matches your target file
-- For navigation, prefer results where `metadata.path_prefix` matches your directory
-
-Client tips:
-- MCP tools: issue multiple finds with variant phrasings and re-rank by score + metadata match
-- Direct Qdrant: use `vector={name: ..., vector: ...}` with the named vector above
-- Data persists in the `qdrant_storage` Docker volume.
-- The MCP server uses SSE transport and will auto-create the collection if it doesn't exist.
-- Only FastEmbed models are supported at this time.
-
-## Troubleshooting
-
-### Collection Health & Cache Sync
-
-The stack includes automatic health checks that detect and fix cache/collection sync issues:
-
-**Check collection health:**
-```bash
-python scripts/collection_health.py --workspace . --collection codebase
-```
-
-**Auto-heal cache issues:**
-```bash
-python scripts/collection_health.py --workspace . --collection codebase --auto-heal
-```
-
-**What it detects:**
-- Empty collection with cached files (cache thinks files are indexed but they're not)
-- Significant mismatch between cached files and actual collection contents
-- Missing metadata in collection points
+> **See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for detailed system design.**
-**When to use:**
-- After manually deleting collections
-- If searches return no results despite indexing
-- After Qdrant crashes or data loss
-- When switching between collection names
+---
-**Automatic healing:**
-- Health checks run automatically on watcher and indexer startup
-- Cache is cleared when sync issues are detected
-- Files are reindexed on next run
+## License
-### General Issues
+MIT
-- If the MCP servers can’t reach Qdrant, confirm both containers are up: `make ps`.
-- If the SSE port collides, change `FASTMCP_PORT` in `.env` and the mapped port in `docker-compose.yml`.
-- If you customize tool descriptions, restart: `make restart`.
-- If searches return no results, check collection health (see above).
diff --git a/deploy/kubernetes/README.md b/deploy/kubernetes/README.md
index 3573917e..5adfeb8c 100644
--- a/deploy/kubernetes/README.md
+++ b/deploy/kubernetes/README.md
@@ -1,5 +1,9 @@
# Kubernetes Deployment Guide
+**Documentation:** [README](../../README.md) · [Configuration](../../docs/CONFIGURATION.md) · [IDE Clients](../../docs/IDE_CLIENTS.md) · [MCP API](../../docs/MCP_API.md) · [ctx CLI](../../docs/CTX_CLI.md) · [Memory Guide](../../docs/MEMORY_GUIDE.md) · [Architecture](../../docs/ARCHITECTURE.md) · [Multi-Repo](../../docs/MULTI_REPO_COLLECTIONS.md) · Kubernetes · [VS Code Extension](../../docs/vscode-extension.md) · [Troubleshooting](../../docs/TROUBLESHOOTING.md) · [Development](../../docs/DEVELOPMENT.md)
+
+---
+
## Overview
This directory contains Kubernetes manifests for deploying Context Engine on a remote cluster using **Kustomize**. This enables:
diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md
index 71172431..fa59d133 100644
--- a/docs/ARCHITECTURE.md
+++ b/docs/ARCHITECTURE.md
@@ -1,5 +1,18 @@
# Context Engine Architecture
+**Documentation:** [README](../README.md) · [Configuration](CONFIGURATION.md) · [IDE Clients](IDE_CLIENTS.md) · [MCP API](MCP_API.md) · [ctx CLI](CTX_CLI.md) · [Memory Guide](MEMORY_GUIDE.md) · [Architecture](ARCHITECTURE.md) · [Multi-Repo](MULTI_REPO_COLLECTIONS.md) · [Kubernetes](../deploy/kubernetes/README.md) · [VS Code Extension](vscode-extension.md) · [Troubleshooting](TROUBLESHOOTING.md) · [Development](DEVELOPMENT.md)
+
+---
+
+**On this page:**
+- [Overview](#overview)
+- [Core Principles](#core-principles)
+- [System Architecture](#system-architecture)
+- [Data Flow](#data-flow)
+- [ReFRAG Pipeline](#refrag-pipeline)
+
+---
+
## Overview
Context Engine is a production-ready MCP (Model Context Protocol) retrieval stack that unifies code indexing, hybrid search, and optional LLM decoding. It enables teams to ship context-aware AI agents by providing sophisticated semantic and lexical search capabilities with dual-transport compatibility.
diff --git a/docs/CONFIGURATION.md b/docs/CONFIGURATION.md
new file mode 100644
index 00000000..dda96fdd
--- /dev/null
+++ b/docs/CONFIGURATION.md
@@ -0,0 +1,161 @@
+# Configuration Reference
+
+Complete environment variable reference for Context Engine.
+
+**Documentation:** [README](../README.md) · [Configuration](CONFIGURATION.md) · [IDE Clients](IDE_CLIENTS.md) · [MCP API](MCP_API.md) · [ctx CLI](CTX_CLI.md) · [Memory Guide](MEMORY_GUIDE.md) · [Architecture](ARCHITECTURE.md) · [Multi-Repo](MULTI_REPO_COLLECTIONS.md) · [Kubernetes](../deploy/kubernetes/README.md) · [VS Code Extension](vscode-extension.md) · [Troubleshooting](TROUBLESHOOTING.md) · [Development](DEVELOPMENT.md)
+
+---
+
+**On this page:**
+- [Core Settings](#core-settings)
+- [Indexing & Micro-Chunks](#indexing--micro-chunks)
+- [Watcher Settings](#watcher-settings)
+- [Reranker](#reranker)
+- [Decoder (llama.cpp / GLM)](#decoder-llamacpp--glm)
+- [ReFRAG](#refrag)
+- [Ports](#ports)
+- [Search & Expansion](#search--expansion)
+- [Memory Blending](#memory-blending)
+
+---
+
+## Core Settings
+
+| Name | Description | Default |
+|------|-------------|---------|
+| COLLECTION_NAME | Qdrant collection name (unified across all repos) | codebase |
+| REPO_NAME | Logical repo tag stored in payload for filtering | auto-detect from git/folder |
+| HOST_INDEX_PATH | Host path mounted at /work in containers | current repo (.) |
+| QDRANT_URL | Qdrant base URL | container: http://qdrant:6333; local: http://localhost:6333 |
+
+## Indexing & Micro-Chunks
+
+| Name | Description | Default |
+|------|-------------|---------|
+| INDEX_MICRO_CHUNKS | Enable token-based micro-chunking | 0 (off) |
+| MAX_MICRO_CHUNKS_PER_FILE | Cap micro-chunks per file | 200 |
+| TOKENIZER_URL | HF tokenizer.json URL (for Make download) | n/a |
+| TOKENIZER_PATH | Local path where tokenizer is saved (Make) | models/tokenizer.json |
+| TOKENIZER_JSON | Runtime path for tokenizer (indexer) | models/tokenizer.json |
+| USE_TREE_SITTER | Enable tree-sitter parsing (py/js/ts) | 0 (off) |
+| INDEX_CHUNK_LINES | Lines per chunk (non-micro mode) | 120 |
+| INDEX_CHUNK_OVERLAP | Overlap lines between chunks | 20 |
+| INDEX_BATCH_SIZE | Upsert batch size | 64 |
+| INDEX_PROGRESS_EVERY | Log progress every N files | 200 |
+
+## Watcher Settings
+
+| Name | Description | Default |
+|------|-------------|---------|
+| WATCH_DEBOUNCE_SECS | Debounce between FS events | 1.5 |
+| INDEX_UPSERT_BATCH | Upsert batch size (watcher) | 128 |
+| INDEX_UPSERT_RETRIES | Retry count | 5 |
+| INDEX_UPSERT_BACKOFF | Seconds between retries | 0.5 |
+| QDRANT_TIMEOUT | HTTP timeout seconds | watcher: 60; search: 20 |
+| MCP_TOOL_TIMEOUT_SECS | Max duration for long-running MCP tools | 3600 |
+
+## Reranker
+
+| Name | Description | Default |
+|------|-------------|---------|
+| RERANKER_ONNX_PATH | Local ONNX cross-encoder model path | unset |
+| RERANKER_TOKENIZER_PATH | Tokenizer path for reranker | unset |
+| RERANKER_ENABLED | Enable reranker by default | 1 (enabled) |
+
+## Decoder (llama.cpp / GLM)
+
+| Name | Description | Default |
+|------|-------------|---------|
+| REFRAG_DECODER | Enable decoder for context_answer | 1 (enabled) |
+| REFRAG_RUNTIME | Decoder backend: llamacpp or glm | llamacpp |
+| LLAMACPP_URL | llama.cpp server endpoint | http://llamacpp:8080 or http://host.docker.internal:8081 |
+| LLAMACPP_TIMEOUT_SEC | Decoder request timeout | 300 |
+| DECODER_MAX_TOKENS | Max tokens for decoder responses | 4000 |
+| REFRAG_DECODER_MODE | prompt or soft (soft requires patched llama.cpp) | prompt |
+| GLM_API_KEY | API key for GLM provider | unset |
+| GLM_MODEL | GLM model name | glm-4.6 |
+| USE_GPU_DECODER | Native Metal decoder (1) vs Docker (0) | 0 (docker) |
+| LLAMACPP_GPU_LAYERS | Number of layers to offload to GPU, -1 for all | 32 |
+
+## ReFRAG (Micro-Chunking & Retrieval)
+
+| Name | Description | Default |
+|------|-------------|---------|
+| REFRAG_MODE | Enable micro-chunking and span budgeting | 1 (enabled) |
+| REFRAG_GATE_FIRST | Enable mini-vector gating | 1 (enabled) |
+| REFRAG_CANDIDATES | Candidates for gate-first filtering | 200 |
+| MICRO_BUDGET_TOKENS | Token budget for context_answer | 512 |
+| MICRO_OUT_MAX_SPANS | Max spans returned per query | 3 |
+| MICRO_CHUNK_TOKENS | Tokens per micro-chunk window | 16 |
+| MICRO_CHUNK_STRIDE | Stride between windows | 8 |
+| MICRO_MERGE_LINES | Lines to merge adjacent spans | 4 |
+| MICRO_TOKENS_PER_LINE | Estimated tokens per line | 32 |
+
+## Ports
+
+| Name | Description | Default |
+|------|-------------|---------|
+| FASTMCP_PORT | Memory MCP server port (SSE) | 8000 |
+| FASTMCP_INDEXER_PORT | Indexer MCP server port (SSE) | 8001 |
+| FASTMCP_HTTP_PORT | Memory RMCP host port mapping | 8002 |
+| FASTMCP_INDEXER_HTTP_PORT | Indexer RMCP host port mapping | 8003 |
+| FASTMCP_HEALTH_PORT | Health port (memory/indexer) | memory: 18000; indexer: 18001 |
+
+## Search & Expansion
+
+| Name | Description | Default |
+|------|-------------|---------|
+| HYBRID_EXPAND | Enable heuristic multi-query expansion | 0 (off) |
+| LLM_EXPAND_MAX | Max alternate queries via LLM | 0 |
+
+## Memory Blending
+
+| Name | Description | Default |
+|------|-------------|---------|
+| MEMORY_SSE_ENABLED | Enable SSE memory blending | false |
+| MEMORY_MCP_URL | Memory MCP endpoint for blending | http://mcp:8000/sse |
+| MEMORY_MCP_TIMEOUT | Timeout for memory queries | 6 |
+| MEMORY_AUTODETECT | Auto-detect memory collection | 1 |
+| MEMORY_COLLECTION_TTL_SECS | Cache TTL for collection detection | 300 |
+
+---
+
+## Exclusions (.qdrantignore)
+
+The indexer supports a `.qdrantignore` file at the repo root (similar to `.gitignore`).
+
+**Default exclusions** (overridable):
+- `/models`, `/node_modules`, `/dist`, `/build`
+- `/.venv`, `/venv`, `/__pycache__`, `/.git`
+- `*.onnx`, `*.bin`, `*.safetensors`, `tokenizer.json`, `*.whl`, `*.tar.gz`
+
+**Override via env or flags:**
+```bash
+# Disable defaults
+QDRANT_DEFAULT_EXCLUDES=0
+
+# Custom ignore file
+QDRANT_IGNORE_FILE=.myignore
+
+# Additional excludes
+QDRANT_EXCLUDES='tokenizer.json,*.onnx,/third_party'
+```
+
+**CLI examples:**
+```bash
+docker compose run --rm indexer --root /work --ignore-file .qdrantignore
+docker compose run --rm indexer --root /work --no-default-excludes --exclude '/vendor' --exclude '*.bin'
+```
+
+---
+
+## Scaling Recommendations
+
+| Repo Size | Chunk Lines | Overlap | Batch Size |
+|-----------|------------|---------|------------|
+| Small (<100 files) | 80-120 | 16-24 | 32-64 |
+| Medium (100s-1k files) | 120-160 | ~20 | 64-128 |
+| Large (1k+ files) | 120 (default) | 20 | 128+ |
+
+For large monorepos, set `INDEX_PROGRESS_EVERY=200` for visibility.
+
diff --git a/docs/CTX_CLI.md b/docs/CTX_CLI.md
new file mode 100644
index 00000000..2a0f620c
--- /dev/null
+++ b/docs/CTX_CLI.md
@@ -0,0 +1,166 @@
+# ctx.py - Prompt Enhancer CLI
+
+A thin CLI that retrieves code context and rewrites your input into a better, context-aware prompt using the local LLM decoder. Works with both questions and commands/instructions.
+
+**Documentation:** [README](../README.md) · [Configuration](CONFIGURATION.md) · [IDE Clients](IDE_CLIENTS.md) · [MCP API](MCP_API.md) · [ctx CLI](CTX_CLI.md) · [Memory Guide](MEMORY_GUIDE.md) · [Architecture](ARCHITECTURE.md) · [Multi-Repo](MULTI_REPO_COLLECTIONS.md) · [Kubernetes](../deploy/kubernetes/README.md) · [VS Code Extension](vscode-extension.md) · [Troubleshooting](TROUBLESHOOTING.md) · [Development](DEVELOPMENT.md)
+
+---
+
+**On this page:**
+- [Basic Usage](#basic-usage)
+- [Detail Mode](#detail-mode)
+- [Unicorn Mode](#unicorn-mode)
+- [Advanced Features](#advanced-features)
+- [GPU Acceleration](#gpu-acceleration)
+- [Configuration](#configuration)
+
+---
+
+## Basic Usage
+
+```bash
+# Questions: Enhanced with specific details and multiple aspects
+scripts/ctx.py "What is ReFRAG?"
+
+# Commands: Enhanced with concrete targets and implementation details
+scripts/ctx.py "Refactor ctx.py"
+
+# Via Make target
+make ctx Q="Explain the caching logic to me in detail"
+
+# Filter by language/path or adjust tokens
+make ctx Q="Hybrid search details" ARGS="--language python --under scripts/ --limit 2 --rewrite-max-tokens 200"
+```
+
+## Detail Mode
+
+Include compact code snippets in the retrieved context for richer rewrites (trades speed for quality):
+
+```bash
+# Enable detail mode (adds short snippets)
+scripts/ctx.py "Explain the caching logic" --detail
+
+# Detail mode with commands
+scripts/ctx.py "Add error handling to ctx.py" --detail
+
+# Adjust snippet size (default is 1 line when --detail is used)
+make ctx Q="Explain hybrid search" ARGS="--detail --context-lines 2"
+```
+
+**Notes:**
+- Default behavior is header-only (fastest). `--detail` adds short snippets.
+- Detail mode is optimized for speed: automatically clamps to max 4 results and 1 result per file.
+
+## Unicorn Mode
+
+Use `--unicorn` for the highest quality prompt enhancement with a staged 2-3 pass approach:
+
+```bash
+# Unicorn mode with commands
+scripts/ctx.py "refactor ctx.py" --unicorn
+
+# Unicorn mode with questions
+scripts/ctx.py "what is ReFRAG and how does it work?" --unicorn
+
+# Works with all filters
+scripts/ctx.py "add error handling" --unicorn --language python
+```
+
+**How it works:**
+
+1. **Pass 1 (Draft)**: Retrieves rich code snippets (8 lines of context) to understand the codebase
+2. **Pass 2 (Refine)**: Retrieves even richer snippets (12 lines) to ground the prompt with concrete code
+3. **Pass 3 (Polish)**: Optional cleanup pass if output appears generic or incomplete
+
+**Key features:**
+- **Code-grounded**: References actual code behaviors and patterns
+- **No hallucinations**: Only uses real code from your indexed repository
+- **Multi-paragraph output**: Produces detailed, comprehensive prompts
+- **Works with both questions and commands**
+
+**When to use:**
+- **Normal mode**: Quick, everyday prompts (fastest)
+- **--detail**: Richer context without multi-pass overhead (balanced)
+- **--unicorn**: When you need the absolute best prompt quality
+
+## Advanced Features
+
+### Streaming Output (Default)
+
+All modes stream tokens as they arrive for instant feedback:
+
+```bash
+scripts/ctx.py "refactor ctx.py" --unicorn
+```
+
+To disable streaming, set `"streaming": false` in `~/.ctx_config.json`
+
+### Memory Blending
+
+Automatically falls back to `context_search` with memories when repo search returns no hits:
+
+```bash
+# If no code matches, ctx.py will search design docs and ADRs
+scripts/ctx.py "What is our authentication strategy?"
+```
+
+### Adaptive Context Sizing
+
+Automatically adjusts `limit` and `context_lines` based on query characteristics:
+- **Short/vague queries** → More context for richer grounding
+- **Queries with file/function names** → Lighter settings for speed
+
+### Automatic Quality Assurance
+
+Enhanced `_needs_polish()` heuristic triggers a third polish pass when:
+- Output is too short (< 180 chars)
+- Contains generic/vague language
+- Missing concrete code references
+- Lacks proper paragraph structure
+
+### Personalized Templates
+
+Create `~/.ctx_config.json` to customize behavior:
+
+```json
+{
+ "always_include_tests": true,
+ "prefer_bullet_commands": false,
+ "extra_instructions": "Always consider error handling and edge cases",
+ "streaming": true
+}
+```
+
+**Available preferences:**
+- `always_include_tests`: Add testing considerations to all prompts
+- `prefer_bullet_commands`: Format commands as bullet points
+- `extra_instructions`: Custom instructions added to every rewrite
+- `streaming`: Enable/disable streaming output (default: true)
+
+See `ctx_config.example.json` for a template.
+
+## GPU Acceleration
+
+For faster prompt rewriting, use the native Metal-accelerated decoder:
+
+```bash
+# Start the native llama.cpp server with Metal GPU
+scripts/gpu_toggle.sh start
+
+# Now ctx.py will automatically use the GPU decoder on port 8081
+make ctx Q="Explain the caching logic"
+
+# Stop the native GPU server
+scripts/gpu_toggle.sh stop
+```
+
+## Configuration
+
+| Setting | Description | Default |
+|---------|-------------|---------|
+| MCP_INDEXER_URL | Indexer HTTP RMCP endpoint | http://localhost:8003/mcp |
+| USE_GPU_DECODER | Auto-detect GPU mode | 0 |
+| LLAMACPP_URL | Docker decoder endpoint | http://localhost:8080 |
+
+GPU decoder (after `gpu_toggle.sh gpu`): http://localhost:8081/completion
+
diff --git a/docs/DEVELOPMENT.md b/docs/DEVELOPMENT.md
index 75c32172..9f44357f 100644
--- a/docs/DEVELOPMENT.md
+++ b/docs/DEVELOPMENT.md
@@ -2,6 +2,19 @@
This guide covers setting up a development environment, understanding the codebase structure, and contributing to Context Engine.
+**Documentation:** [README](../README.md) · [Configuration](CONFIGURATION.md) · [IDE Clients](IDE_CLIENTS.md) · [MCP API](MCP_API.md) · [ctx CLI](CTX_CLI.md) · [Memory Guide](MEMORY_GUIDE.md) · [Architecture](ARCHITECTURE.md) · [Multi-Repo](MULTI_REPO_COLLECTIONS.md) · [Kubernetes](../deploy/kubernetes/README.md) · [VS Code Extension](vscode-extension.md) · [Troubleshooting](TROUBLESHOOTING.md) · [Development](DEVELOPMENT.md)
+
+---
+
+**On this page:**
+- [Prerequisites](#prerequisites)
+- [Quick Start](#quick-start)
+- [Project Structure](#project-structure)
+- [Testing](#testing)
+- [Docker Development](#docker-development)
+
+---
+
## Prerequisites
### Required Software
diff --git a/docs/IDE_CLIENTS.md b/docs/IDE_CLIENTS.md
new file mode 100644
index 00000000..2988577f
--- /dev/null
+++ b/docs/IDE_CLIENTS.md
@@ -0,0 +1,193 @@
+# IDE & Client Configuration
+
+Configuration examples for connecting various IDEs and MCP clients to Context Engine.
+
+**Documentation:** [README](../README.md) · [Configuration](CONFIGURATION.md) · [IDE Clients](IDE_CLIENTS.md) · [MCP API](MCP_API.md) · [ctx CLI](CTX_CLI.md) · [Memory Guide](MEMORY_GUIDE.md) · [Architecture](ARCHITECTURE.md) · [Multi-Repo](MULTI_REPO_COLLECTIONS.md) · [Kubernetes](../deploy/kubernetes/README.md) · [VS Code Extension](vscode-extension.md) · [Troubleshooting](TROUBLESHOOTING.md) · [Development](DEVELOPMENT.md)
+
+---
+
+**On this page:**
+- [Supported Clients](#supported-clients)
+- [SSE Clients](#sse-clients-port-80008001)
+- [RMCP Clients](#rmcp-clients-port-80028003)
+- [Mixed Transport](#mixed-transport-examples)
+- [Verification](#verification)
+
+---
+
+## Supported Clients
+
+| Client | Transport | Notes |
+|--------|-----------|-------|
+| Roo | SSE/RMCP | Both SSE and RMCP connections |
+| Cline | SSE/RMCP | Both SSE and RMCP connections |
+| Windsurf | SSE/RMCP | Both SSE and RMCP connections |
+| Zed | SSE | Uses mcp-remote bridge |
+| Kiro | SSE | Uses mcp-remote bridge |
+| Qodo | RMCP | Direct HTTP endpoints |
+| OpenAI Codex | RMCP | TOML config |
+| Augment | SSE | Simple JSON configs |
+| AmpCode | SSE | Simple URL for SSE endpoints |
+| Claude Code CLI | SSE | Simple JSON configs |
+
+---
+
+## SSE Clients (port 8000/8001)
+
+### Roo / Cline / Windsurf
+
+```json
+{
+ "mcpServers": {
+ "memory": { "type": "sse", "url": "http://localhost:8000/sse", "disabled": false },
+ "qdrant-indexer": { "type": "sse", "url": "http://localhost:8001/sse", "disabled": false }
+ }
+}
+```
+
+### Augment
+
+```json
+{
+ "mcpServers": {
+ "memory": { "type": "sse", "url": "http://localhost:8000/sse", "disabled": false },
+ "qdrant-indexer": { "type": "sse", "url": "http://localhost:8001/sse", "disabled": false }
+ }
+}
+```
+
+### Kiro
+
+Create `.kiro/settings/mcp.json` in your workspace:
+
+```json
+{
+ "mcpServers": {
+ "qdrant-indexer": { "command": "npx", "args": ["mcp-remote", "http://localhost:8001/sse", "--transport", "sse-only"] },
+ "memory": { "command": "npx", "args": ["mcp-remote", "http://localhost:8000/sse", "--transport", "sse-only"] }
+ }
+}
+```
+
+**Notes:**
+- Kiro expects command/args (stdio). `mcp-remote` bridges to remote SSE endpoints.
+- If `npx` prompts in your environment, add `-y` right after `npx`.
+- Workspace config overrides user-level config (`~/.kiro/settings/mcp.json`).
+
+**Troubleshooting:**
+- Error: "Enabled MCP Server must specify a command, ignoring." → Use the command/args form; do not use type:url in Kiro.
+
+### Zed
+
+Add to your Zed `settings.json` (Command Palette → "Settings: Open Settings (JSON)"):
+
+```json
+{
+ "qdrant-indexer": {
+ "command": "npx",
+ "args": ["mcp-remote", "http://localhost:8001/sse", "--transport", "sse-only"],
+ "env": {}
+ }
+}
+```
+
+**Notes:**
+- Zed expects MCP servers at the root level of settings.json
+- Uses command/args (stdio). mcp-remote bridges to remote SSE endpoints
+- If npx prompts, add `-y` right after npx: `"args": ["-y", "mcp-remote", ...]`
+
+**Alternative (direct HTTP):**
+```json
+{
+ "qdrant-indexer": {
+ "type": "http",
+ "url": "http://localhost:8001/sse"
+ }
+}
+```
+
+---
+
+## RMCP Clients (port 8002/8003)
+
+### Qodo
+
+Add each MCP tool separately through the UI:
+
+**Tool 1 - memory:**
+```json
+{
+ "memory": { "url": "http://localhost:8002/mcp" }
+}
+```
+
+**Tool 2 - qdrant-indexer:**
+```json
+{
+ "qdrant-indexer": { "url": "http://localhost:8003/mcp" }
+}
+```
+
+**Note:** Qodo can talk to RMCP endpoints directly, no `mcp-remote` wrapper needed.
+
+### OpenAI Codex
+
+TOML configuration:
+
+```toml
+experimental_use_rmcp_client = true
+
+[mcp_servers.memory_http]
+url = "http://127.0.0.1:8002/mcp"
+
+[mcp_servers.qdrant_indexer_http]
+url = "http://127.0.0.1:8003/mcp"
+```
+
+---
+
+## Mixed Transport (stdio + SSE)
+
+### Windsurf/Cursor
+
+```json
+{
+ "mcpServers": {
+ "qdrant": {
+ "command": "uvx",
+ "args": ["mcp-server-qdrant"],
+ "env": {
+ "QDRANT_URL": "http://localhost:6333",
+ "COLLECTION_NAME": "my-collection",
+ "EMBEDDING_MODEL": "BAAI/bge-base-en-v1.5"
+ },
+ "disabled": false
+ }
+ }
+}
+```
+
+---
+
+## Important Notes for IDE Agents
+
+- **Do not send null values** to MCP tools. Omit the field or pass an empty string "" instead.
+- **qdrant-index examples:**
+ - `{"subdir":"","recreate":false,"collection":"my-collection","repo_name":"workspace"}`
+ - `{"subdir":"scripts","recreate":true}`
+- For indexing repo root with no params, use `qdrant_index_root` (zero-arg) or call `qdrant-index` with `subdir:""`.
+
+---
+
+## Verification
+
+After configuring, you should see tools from both servers:
+- `store`, `find` (Memory)
+- `repo_search`, `code_search`, `context_search`, `context_answer` (Indexer)
+- `qdrant_list`, `qdrant_index`, `qdrant_prune`, `qdrant_status` (Indexer)
+
+Test connectivity:
+- Call `qdrant_list` to confirm Qdrant connectivity
+- Call `qdrant_index` with `{ "subdir": "scripts", "recreate": true }` to test indexing
+- Call `context_search` with `{ "include_memories": true }` to test memory blending
+
diff --git a/docs/MCP_API.md b/docs/MCP_API.md
index 490c3dfc..73b1b8ef 100644
--- a/docs/MCP_API.md
+++ b/docs/MCP_API.md
@@ -2,6 +2,19 @@
This document provides comprehensive API documentation for all MCP (Model Context Protocol) tools exposed by Context Engine's dual-server architecture.
+**Documentation:** [README](../README.md) · [Configuration](CONFIGURATION.md) · [IDE Clients](IDE_CLIENTS.md) · [MCP API](MCP_API.md) · [ctx CLI](CTX_CLI.md) · [Memory Guide](MEMORY_GUIDE.md) · [Architecture](ARCHITECTURE.md) · [Multi-Repo](MULTI_REPO_COLLECTIONS.md) · [Kubernetes](../deploy/kubernetes/README.md) · [VS Code Extension](vscode-extension.md) · [Troubleshooting](TROUBLESHOOTING.md) · [Development](DEVELOPMENT.md)
+
+---
+
+**On this page:**
+- [Overview](#overview)
+- [Memory Server API](#memory-server-api) - `store()`, `find()`
+- [Indexer Server API](#indexer-server-api) - `repo_search()`, `context_search()`, `context_answer()`, etc.
+- [Response Schemas](#response-schemas)
+- [Error Handling](#error-handling)
+
+---
+
## Overview
Context Engine exposes two MCP servers:
@@ -504,6 +517,110 @@ Generate alternative query variations using local LLM (requires decoder enabled)
}
```
+### code_search()
+
+Exact alias of `repo_search()` for discoverability. Same parameters and return shape.
+
+### qdrant_index_root()
+
+Index the entire workspace root (`/work`).
+
+**Parameters:**
+- `recreate` (bool, default false): Drop and recreate collection before indexing
+- `collection` (str, optional): Target collection name
+
+**Returns:** Subprocess result with indexing status.
+
+### search_tests_for()
+
+Find test files related to a query. Presets common test file globs.
+
+**Parameters:**
+- `query` (str or list[str], required): Search query
+- `limit` (int, optional): Max results
+- `include_snippet` (bool, optional): Include code snippets
+- `language` (str, optional): Filter by language
+
+**Returns:** Same shape as `repo_search()`.
+
+### search_config_for()
+
+Find configuration files related to a query. Presets config file globs (yaml/json/toml/etc).
+
+**Parameters:** Same as `search_tests_for()`.
+
+**Returns:** Same shape as `repo_search()`.
+
+### search_callers_for()
+
+Heuristic search for callers/usages of a symbol.
+
+**Parameters:**
+- `query` (str, required): Symbol name to find callers for
+- `limit` (int, optional): Max results
+- `language` (str, optional): Filter by language
+
+**Returns:** Same shape as `repo_search()`.
+
+### search_importers_for()
+
+Find files likely importing or referencing a module/symbol.
+
+**Parameters:** Same as `search_callers_for()`.
+
+**Returns:** Same shape as `repo_search()`.
+
+### change_history_for_path()
+
+Summarize recent change metadata for a file path from the index.
+
+**Parameters:**
+- `path` (str, required): Relative path under /work
+- `collection` (str, optional): Target collection
+- `max_points` (int, optional): Cap on scanned points
+
+**Returns:**
+```json
+{
+ "ok": true,
+ "summary": {
+ "path": "scripts/ctx.py",
+ "last_modified": "2025-01-15T14:22:00"
+ }
+}
+```
+
+### collection_map()
+
+Return collection↔repo mappings with optional Qdrant payload samples.
+
+**Parameters:**
+- `search_root` (str, optional): Directory to scan
+- `collection` (str, optional): Filter by collection
+- `repo_name` (str, optional): Filter by repo
+- `include_samples` (bool, optional): Include payload samples
+- `limit` (int, optional): Max entries
+
+**Returns:** Mapping of collections to repositories.
+
+### set_session_defaults() (Indexer)
+
+Set default collection for subsequent calls on the same session.
+
+**Parameters:**
+- `collection` (str, optional): Default collection name
+- `session` (str, optional): Session token for cross-connection reuse
+
+**Returns:**
+```json
+{
+ "ok": true,
+ "session": "abc123",
+ "defaults": {"collection": "codebase"},
+ "applied": "connection"
+}
+```
+
## Error Handling
All API methods follow consistent error handling patterns:
diff --git a/docs/MEMORY_GUIDE.md b/docs/MEMORY_GUIDE.md
new file mode 100644
index 00000000..c0904335
--- /dev/null
+++ b/docs/MEMORY_GUIDE.md
@@ -0,0 +1,171 @@
+# Memory Usage Guide
+
+Best practices for using Context Engine's memory system effectively.
+
+**Documentation:** [README](../README.md) · [Configuration](CONFIGURATION.md) · [IDE Clients](IDE_CLIENTS.md) · [MCP API](MCP_API.md) · [ctx CLI](CTX_CLI.md) · [Memory Guide](MEMORY_GUIDE.md) · [Architecture](ARCHITECTURE.md) · [Multi-Repo](MULTI_REPO_COLLECTIONS.md) · [Kubernetes](../deploy/kubernetes/README.md) · [VS Code Extension](vscode-extension.md) · [Troubleshooting](TROUBLESHOOTING.md) · [Development](DEVELOPMENT.md)
+
+---
+
+**On this page:**
+- [When to Use Memories vs Code Search](#when-to-use-memories-vs-code-search)
+- [Recommended Metadata Schema](#recommended-metadata-schema)
+- [Example Operations](#example-operations)
+- [Memory Blending](#memory-blending)
+- [Collection Naming](#collection-naming)
+
+---
+
+## When to Use Memories vs Code Search
+
+| Use Memories For | Use Code Search For |
+|------------------|---------------------|
+| Conventions, runbooks, decisions | APIs, functions, classes |
+| Links, known issues, FAQs | Configuration files |
+| "How we do X here" notes | Cross-file relationships |
+| Team wiki-style content | Anything you'd grep for |
+
+**Blend both** for tasks like "how to run E2E tests" where instructions (memory) reference scripts in the repo (code).
+
+---
+
+## Recommended Metadata Schema
+
+Memory entries are stored as points in Qdrant with a consistent payload:
+
+| Key | Type | Description |
+|-----|------|-------------|
+| `kind` | string | **Required.** Always "memory" to enable filtering/blending |
+| `topic` | string | Short category (e.g., "dev-env", "release-process") |
+| `tags` | list[str] | Searchable tags (e.g., ["qdrant", "indexing", "prod"]) |
+| `source` | string | Origin (e.g., "chat", "manual", "tool", "issue-123") |
+| `author` | string | Who added it (username or email) |
+| `created_at` | string | ISO8601 timestamp (UTC) |
+| `expires_at` | string | ISO8601 timestamp if memory should be pruned later |
+| `repo` | string | Optional repo identifier for shared instances |
+| `link` | string | Optional URL to docs, tickets, or dashboards |
+| `priority` | float | 0.0-1.0 weight for ranking when blending |
+
+**Tips:**
+- Keep values small (short strings, small lists)
+- Put details in the `information` text, not payload
+- Use lowercase snake_case keys
+- For secrets/PII: store references or vault paths, never plaintext
+
+---
+
+## Example Operations
+
+### Store a Memory
+
+Via MCP Memory server tool `store`:
+
+```json
+{
+ "information": "Run full reset: INDEX_MICRO_CHUNKS=1 MAX_MICRO_CHUNKS_PER_FILE=200 make reset-dev",
+ "metadata": {
+ "kind": "memory",
+ "topic": "dev-env",
+ "tags": ["make", "reset"],
+ "source": "chat"
+ }
+}
+```
+
+### Find Memories
+
+Via MCP Memory server tool `find`:
+
+```json
+{
+ "query": "reset-dev",
+ "limit": 5
+}
+```
+
+### Blend Memories into Code Search
+
+Via Indexer MCP `context_search`:
+
+```json
+{
+ "query": "async file watcher",
+ "include_memories": true,
+ "limit": 5,
+ "include_snippet": true
+}
+```
+
+---
+
+## Query Tips
+
+- Use precise queries (2-5 tokens)
+- Add synonyms if needed; the server supports multiple phrasings
+- Combine `topic`/`tags` in your memory text to make them easier to find
+
+---
+
+## Enable Memory Blending
+
+1. Ensure the Memory MCP is running on :8000 (default in compose)
+
+2. Enable SSE memory blending on the Indexer MCP by setting these env vars:
+
+```yaml
+services:
+ mcp_indexer:
+ environment:
+ - MEMORY_SSE_ENABLED=true
+ - MEMORY_MCP_URL=http://mcp:8000/sse
+ - MEMORY_MCP_TIMEOUT=6
+```
+
+3. Restart the indexer:
+
+```bash
+docker compose up -d mcp_indexer
+```
+
+4. Validate with `context_search`:
+
+```json
+{
+ "query": "your test memory text",
+ "include_memories": true,
+ "limit": 5
+}
+```
+
+Expected: non-zero results with blended items; memory hits will have `metadata.kind = "memory"`.
+
+---
+
+## Collection Naming Strategies
+
+Different hash lengths for different workspace types:
+
+**Local Workspaces:** `repo-name-8charhash`
+- Example: `Anesidara-e8d0f5fc`
+- Used by local indexer/watcher
+- Assumes unique repo names within workspace
+
+**Remote Uploads:** `folder-name-16charhash-8charhash`
+- Example: `testupload2-04e680d5939dd035-b8b8d4cc`
+- Collision avoidance for duplicate folder names
+- 16-char hash identifies workspace, 8-char hash identifies collection
+
+---
+
+## Operational Notes
+
+- Collection name comes from `COLLECTION_NAME` (see .env)
+- This stack defaults to a single collection for both code and memories
+- Filtering uses `metadata.kind` to distinguish memory from code
+- Consider pruning expired memories by filtering `expires_at < now`
+
+---
+
+## Backup and Migration
+
+For production-grade backup/migration strategies, see the official Qdrant documentation for snapshots and export/import. For local development, rely on Docker volumes and reindexing when needed.
+
diff --git a/docs/MULTI_REPO_COLLECTIONS.md b/docs/MULTI_REPO_COLLECTIONS.md
index e43a5d60..991d2cac 100644
--- a/docs/MULTI_REPO_COLLECTIONS.md
+++ b/docs/MULTI_REPO_COLLECTIONS.md
@@ -1,5 +1,18 @@
# Multi-Repository Collection Architecture
+**Documentation:** [README](../README.md) · [Configuration](CONFIGURATION.md) · [IDE Clients](IDE_CLIENTS.md) · [MCP API](MCP_API.md) · [ctx CLI](CTX_CLI.md) · [Memory Guide](MEMORY_GUIDE.md) · [Architecture](ARCHITECTURE.md) · [Multi-Repo](MULTI_REPO_COLLECTIONS.md) · [Kubernetes](../deploy/kubernetes/README.md) · [VS Code Extension](vscode-extension.md) · [Troubleshooting](TROUBLESHOOTING.md) · [Development](DEVELOPMENT.md)
+
+---
+
+**On this page:**
+- [Overview](#overview)
+- [Architecture Principles](#architecture-principles)
+- [Indexing Multiple Repositories](#indexing-multiple-repositories)
+- [Filtering by Repository](#filtering-by-repository)
+- [Remote Deployment](#remote-deployment)
+
+---
+
## Overview
Context Engine supports first-class multi-repository operation through a unified collection architecture. This enables:
diff --git a/docs/TROUBLESHOOTING.md b/docs/TROUBLESHOOTING.md
new file mode 100644
index 00000000..34913d72
--- /dev/null
+++ b/docs/TROUBLESHOOTING.md
@@ -0,0 +1,161 @@
+# Troubleshooting Guide
+
+Common issues and solutions for Context Engine.
+
+**Documentation:** [README](../README.md) · [Configuration](CONFIGURATION.md) · [IDE Clients](IDE_CLIENTS.md) · [MCP API](MCP_API.md) · [ctx CLI](CTX_CLI.md) · [Memory Guide](MEMORY_GUIDE.md) · [Architecture](ARCHITECTURE.md) · [Multi-Repo](MULTI_REPO_COLLECTIONS.md) · [Kubernetes](../deploy/kubernetes/README.md) · [VS Code Extension](vscode-extension.md) · [Troubleshooting](TROUBLESHOOTING.md) · [Development](DEVELOPMENT.md)
+
+---
+
+**On this page:**
+- [Collection Health & Cache Sync](#collection-health--cache-sync)
+- [Common Issues](#common-issues)
+- [Connectivity Issues](#connectivity-issues)
+- [Endpoint Verification](#endpoint-verification)
+- [Debug Logging](#debug-logging)
+
+---
+
+## Collection Health & Cache Sync
+
+The stack includes automatic health checks that detect and fix cache/collection sync issues.
+
+### Check collection health
+```bash
+python scripts/collection_health.py --workspace . --collection codebase
+```
+
+### Auto-heal cache issues
+```bash
+python scripts/collection_health.py --workspace . --collection codebase --auto-heal
+```
+
+### What it detects
+- Empty collection with cached files (cache thinks files are indexed but they're not)
+- Significant mismatch between cached files and actual collection contents
+- Missing metadata in collection points
+
+### When to use
+- After manually deleting collections
+- If searches return no results despite indexing
+- After Qdrant crashes or data loss
+- When switching between collection names
+
+### Automatic healing
+- Health checks run automatically on watcher and indexer startup
+- Cache is cleared when sync issues are detected
+- Files are reindexed on next run
+
+---
+
+## Common Issues
+
+### Tree-sitter not found or parser errors
+Feature is optional. If you set `USE_TREE_SITTER=1` and see errors, unset it or install tree-sitter deps, then reindex.
+
+### Tokenizer missing for micro-chunks
+Run `make tokenizer` or set `TOKENIZER_JSON` to a valid tokenizer.json. Otherwise, falls back to line-based chunking.
+
+### SSE "Invalid session ID" when POSTing /messages directly
+Expected if you didn't initiate an SSE session first. Use an MCP client (e.g., mcp-remote) to handle the handshake.
+
+### llama.cpp platform warning on Apple Silicon
+Prefer the native path (`scripts/gpu_toggle.sh gpu`). If you stick with Docker, add `platform: linux/amd64` to the service or ignore the warning during local dev.
+
+### Indexing feels stuck on very large files
+Use `MAX_MICRO_CHUNKS_PER_FILE=200` during dev runs.
+
+### Watcher timeouts (-9) or Qdrant "ResponseHandlingException: timed out"
+Set watcher-safe defaults to reduce payload size and add headroom during upserts:
+
+```ini
+QDRANT_TIMEOUT=60
+MAX_MICRO_CHUNKS_PER_FILE=200
+INDEX_UPSERT_BATCH=128
+INDEX_UPSERT_RETRIES=5
+INDEX_UPSERT_BACKOFF=0.5
+WATCH_DEBOUNCE_SECS=1.5
+```
+
+If issues persist, try lowering `INDEX_UPSERT_BATCH` to 96 or raising `QDRANT_TIMEOUT` to 90.
+
+---
+
+## Connectivity Issues
+
+### MCP servers can't reach Qdrant
+Confirm both containers are up: `make ps`.
+
+### SSE port collides
+Change `FASTMCP_PORT` in `.env` and the mapped port in `docker-compose.yml`.
+
+### Searches return no results
+Check collection health (see above).
+
+### Tool descriptions out of date
+Restart: `make restart`.
+
+---
+
+## Verify Endpoints
+
+```bash
+# Qdrant DB
+curl -sSf http://localhost:6333/readyz >/dev/null && echo "Qdrant OK"
+
+# Decoder (llama.cpp sidecar)
+curl -s http://localhost:8080/health
+
+# SSE endpoints (Memory, Indexer)
+curl -sI http://localhost:8000/sse | head -n1
+curl -sI http://localhost:8001/sse | head -n1
+
+# RMCP endpoints (HTTP JSON-RPC)
+curl -sI http://localhost:8002/mcp | head -n1
+curl -sI http://localhost:8003/mcp | head -n1
+```
+
+---
+
+## Expected HTTP Behaviors
+
+- **GET /mcp returns 400**: Normal - the RMCP endpoint is POST-only for JSON-RPC
+- **SSE requires session handshake**: Raw POST /messages without it will error (expected)
+
+---
+
+## Operational Safeguards
+
+| Setting | Purpose | Default |
+|---------|---------|---------|
+| TOKENIZER_JSON | Tokenizer for micro-chunking | models/tokenizer.json |
+| MAX_MICRO_CHUNKS_PER_FILE | Prevent runaway chunk counts | 2000 |
+| QDRANT_TIMEOUT | HTTP timeout for MCP Qdrant calls | 20s |
+| MEMORY_AUTODETECT | Auto-detect memory collection | 1 |
+| MEMORY_COLLECTION_TTL_SECS | Cache TTL for collection detection | 300s |
+
+**Schema repair:** `ensure_collection` now repairs missing named vectors (lex, mini when REFRAG_MODE=1) on existing collections.
+
+---
+
+## Debug Logging
+
+Enable debug environment variables for detailed logging:
+
+```bash
+export DEBUG_CONTEXT_ANSWER=1
+export HYBRID_DEBUG=1
+export CACHE_DEBUG=1
+
+# Restart services
+docker-compose restart
+```
+
+---
+
+## Getting Help
+
+1. Check this troubleshooting guide
+2. Review logs: `docker compose logs mcp_indexer`
+3. Verify health: `make health`
+4. Check Qdrant status: `make qdrant-status`
+
diff --git a/docs/vscode-extension.md b/docs/vscode-extension.md
index ec86cc16..b83cf772 100644
--- a/docs/vscode-extension.md
+++ b/docs/vscode-extension.md
@@ -1,8 +1,29 @@
-Context Engine Uploader VS Code Extension
-=========================================
+# VS Code Extension
-Build Prerequisites
--------------------
+Context Engine Uploader extension for automatic workspace sync and Prompt+ integration.
+
+**Documentation:** [README](../README.md) · [Configuration](CONFIGURATION.md) · [IDE Clients](IDE_CLIENTS.md) · [MCP API](MCP_API.md) · [ctx CLI](CTX_CLI.md) · [Memory Guide](MEMORY_GUIDE.md) · [Architecture](ARCHITECTURE.md) · [Multi-Repo](MULTI_REPO_COLLECTIONS.md) · [Kubernetes](../deploy/kubernetes/README.md) · [VS Code Extension](vscode-extension.md) · [Troubleshooting](TROUBLESHOOTING.md) · [Development](DEVELOPMENT.md)
+
+---
+
+**On this page:**
+- [Features](#features)
+- [Installation](#installation)
+- [Configuration](#configuration)
+- [Commands](#commands-and-lifecycle)
+
+---
+
+## Features
+
+- **Auto-sync**: Force sync on startup + watch mode keeps your workspace indexed
+- **Prompt+ button**: Status bar button to enhance selected text with unicorn mode
+- **Output channel**: Real-time logs for force-sync and watch operations
+- **GPU decoder support**: Configure llama.cpp, Ollama, or GLM as decoder backend
+
+## Installation
+
+### Build Prerequisites
- Node.js 18+ and npm
- Python 3 available on PATH for runtime testing
- VS Code Extension Manager `vsce` (`npm install -g @vscode/vsce`) or run via `npx`