Context-Engine is a plug-and-play MCP retrieval stack that unifies code indexing, hybrid search, and optional llama.cpp decoding so product teams can ship context-aware agents in minutes, not weeks.
Key differentiators
- One-command bring-up delivers dual SSE/RMCP endpoints, seeded Qdrant, and live watch/reindex loops for fast local validation.
- ReFRAG-inspired micro-chunking, token budgeting, and gate-first filtering surface precise spans while keeping prompts lean.
- Shared memory/indexer schema and reranker tooling make it easy to mix dense, lexical, and semantic signals without bespoke glue code.
- NEW: Performance optimizations including connection pooling, intelligent caching, request deduplication, and async subprocess management that cut redundant calls and smooth spikes under load.
- Operational playbooks (prune, warm, health, cache) plus rich tests give teams confidence to take the stack from laptop to production.
Built for
- AI platform and IDE tooling teams that need an MCP-compliant context layer without rebuilding indexing, embeddings, or retrieval heuristics.
- DevEx and documentation groups standing up internal assistants that must ingest large or fast-changing codebases with minimal babysitting.
Solves
- Slow agent onboarding caused by fractured infra—ship a consistent stack for memory, search, and decoding under one config.
- Context drift in monorepos—automatic micro-chunking and watcher-driven reindexing keep embeddings aligned with reality.
- Fragmented client compatibility—serve both legacy SSE and modern HTTP RMCP clients from the same deployment.
- NEW: Performance relief via intelligent caching, connection pooling, and async I/O patterns that eliminate redundant processing.
This gets you from zero to “search works” in under five minutes.
- Prereqs
- Docker + Docker Compose
- make (optional but recommended)
- Node/npm if you want to use mcp-remote (optional)
- command (recommended)
# Provisions tokenizer.json, downloads a tiny llama.cpp model, reindexes, and brings all services up
INDEX_MICRO_CHUNKS=1 MAX_MICRO_CHUNKS_PER_FILE=200 make reset-dev-dual# Provisions the context-engine for rapid development,
HOST_INDEX_PATH=. COLLECTION_NAME=codebase docker compose run --rm indexer --root /work --recreate --no-skip-unchanged- Default ports: Memory MCP :8000, Indexer MCP :8001, 8003, Qdrant :6333, llama.cpp :8080
Seamless Setup Note:
- The stack uses a single unified
codebasecollection by default - All your code goes into one collection for seamless cross-repo search
- No per-workspace fragmentation - search across everything at once
- Health checks auto-detect and fix cache/collection sync issues
- Just run
make reset-dev-dualon any machine and it works™
- Legacy SSE only (default):
- Ports: 8000 (/sse), 8001 (/sse)
- Command:
INDEX_MICRO_CHUNKS=1 MAX_MICRO_CHUNKS_PER_FILE=200 make reset-dev
- RMCP (Codex) only:
- Ports: 8002 (/mcp), 8003 (/mcp)
- Command:
INDEX_MICRO_CHUNKS=1 MAX_MICRO_CHUNKS_PER_FILE=200 make reset-dev-codex
- Dual compatibility (SSE + RMCP together):
- Ports: 8000/8001 (/sse) and 8002/8003 (/mcp)
- Command:
INDEX_MICRO_CHUNKS=1 MAX_MICRO_CHUNKS_PER_FILE=200 make reset-dev-dual
Default Setup:
- The repository includes
.env.examplewith sensible defaults for local development - On first run, copy it to
.env:cp .env.example .env - The
make reset-dev*targets will use your.envsettings automatically
Key Configuration Files:
.env— Your local environment variables (gitignored, safe to customize).env.example— Template with documented defaults (committed to repo)docker-compose.yml— Service definitions that read from.env
Recommended Customizations:
-
Enable micro-chunking (better retrieval quality):
INDEX_MICRO_CHUNKS=1 MAX_MICRO_CHUNKS_PER_FILE=200
-
Enable decoder for Q&A (context_answer tool):
REFRAG_DECODER=1 # Enable decoder (default: 1) REFRAG_RUNTIME=llamacpp # Use llama.cpp (default) or glm
-
GPU acceleration (Apple Silicon Metal):
# Option A: Use the toggle script (recommended) scripts/gpu_toggle.sh gpu scripts/gpu_toggle.sh start # Option B: Manual .env settings USE_GPU_DECODER=1 LLAMACPP_URL=http://host.docker.internal:8081 LLAMACPP_GPU_LAYERS=32 # or -1 for all layers
-
Alternative: GLM API (instead of local llama.cpp):
REFRAG_RUNTIME=glm GLM_API_KEY=your-api-key-here GLM_MODEL=glm-4.6 # Optional, defaults to glm-4.6 -
Collection name (unified by default):
COLLECTION_NAME=codebase # Default: single unified collection for all code # Only change this if you need isolated collections per project
After changing .env:
- Restart services:
docker compose restart mcp_indexer mcp_indexer_http - For indexing changes:
make reindexormake reindex-hard - For decoder changes:
docker compose up -d --force-recreate llamacpp(or restart native server)
- Default tiny model: Granite 4.0 Micro (Q4_K_M GGUF)
- Change the model by overriding Make vars (downloads to ./models/model.gguf):
LLAMACPP_MODEL_URL="https://huggingface.co/ORG/MODEL/resolve/main/model.gguf" \
INDEX_MICRO_CHUNKS=1 MAX_MICRO_CHUNKS_PER_FILE=200 make reset-dev-dual- Want GPU acceleration? Set
LLAMACPP_USE_GPU=1(optionallyLLAMACPP_GPU_LAYERS=-1) in your.envbeforedocker compose up, or simply runscripts/gpu_toggle.sh gpu(described below) to flip the switch for you. - Embeddings: set EMBEDDING_MODEL in .env and reindex (make reindex)
Decoder env toggles (set in .env and managed automatically by scripts/gpu_toggle.sh):
| Variable | Description | Typical values |
|---|---|---|
USE_GPU_DECODER |
Feature-flag for native Metal decoder | 0 (docker), 1 (native) |
LLAMACPP_URL |
Decoder endpoint containers should use | http://llamacpp:8080 or http://host.docker.internal:8081 |
LLAMACPP_GPU_LAYERS |
Number of layers to offload to GPU (-1 = all) |
0, 32, -1 |
Alternative (compose only)
HOST_INDEX_PATH="$(pwd)" FASTMCP_INDEXER_PORT=8001 docker compose up -d qdrant mcp mcp_indexer indexer watcher- Bring the stack up with the reset target that matches your client (
make reset-dev,make reset-dev-codex, ormake reset-dev-dual). - When you need a clean ingest (after large edits or when the
qdrant_statustool/make qdrant-statusreports zero points), runmake reindex-hard. This clears.codebase/cache.jsonbefore recreating the collection so unchanged files cannot be skipped. - Confirm collection health with
make qdrant-status(calls the MCP router to print counts and timestamps). - Iterate using search helpers such as
make hybrid ARGS="--query 'async file watcher'"or invoke the MCP tools directly from your client.
On Apple Silicon you can run the llama.cpp decoder natively with Metal while keeping the rest of the stack in Docker:
- Install the Metal-enabled llama.cpp binary (e.g.
brew install llama.cpp). - Flip to GPU mode and start the native server:
The toggle updates
scripts/gpu_toggle.sh gpu scripts/gpu_toggle.sh start # launches llama-server on localhost:8081 docker compose up -d --force-recreate mcp_indexer mcp_indexer_http docker compose stop llamacpp # optional once the native server is healthy
.envto point athttp://host.docker.internal:8081so containers reach the host process. - Run
scripts/gpu_toggle.sh statusto confirm the native server is healthy. All MCPcontext_answercalls will now use the Metal-backed decoder.
Want the original dockerised decoder (CPU-only or x86 GPU fallback)? Swap back with:
scripts/gpu_toggle.sh docker
docker compose up -d --force-recreate mcp_indexer mcp_indexer_http llamacppThis re-enables the llamacpp container and resets .env to http://llamacpp:8080.
- reset-dev: SSE stack on 8000/8001; seeds Qdrant, downloads tokenizer + tiny llama.cpp model, reindexes, brings up memory + indexer + watcher
- reset-dev-codex: RMCP stack on 8002/8003; same seeding + bring-up for Codex/Qodo
- reset-dev-dual: SSE + RMCP together (8000/8001 and 8002/8003)
- up / down / logs / ps: Docker Compose lifecycle helpers
- index / reindex / reindex-hard: Index current repo;
reindexrecreates the collection;reindex-hardalso clears the local cache so unchanged files are re-uploaded - index-here / index-path: Index arbitrary host path without cloning into this repo
- watch: Watch-and-reindex on file changes
- warm / health: Warm caches and run health checks
- hybrid / rerank: Example hybrid search + reranker helper
- setup-reranker / rerank-local / quantize-reranker: Manage ONNX reranker assets and local runs
- prune / prune-path: Remove stale points (missing files or hash mismatch)
- llama-model / tokenizer: Fetch tiny GGUF model and tokenizer.json
- qdrant-status / qdrant-list / qdrant-prune / qdrant-index-root: Convenience wrappers that route through the MCP bridge to inspect or maintain collections
A thin CLI that retrieves code context and rewrites your input into a better, context-aware prompt using the local LLM decoder. Works with both questions and commands/instructions. By default it prints ONLY the improved prompt.
Examples:
# Questions: Enhanced with specific details and multiple aspects
scripts/ctx.py "What is ReFRAG?"
# Output: Two detailed question paragraphs with file/line references
# Commands: Enhanced with concrete targets and implementation details
scripts/ctx.py "Refactor ctx.py"
# Output: Two detailed instruction paragraphs with specific steps
# Unicorn mode: staged 2–3 pass enhancement for best results
scripts/ctx.py "Refactor ctx.py" --unicorn
# Via Make target (default improved prompt only)
make ctx Q="Explain the caching logic to me in detail"
# Filter by language/path or adjust tokens
make ctx Q="Hybrid search details" ARGS="--language python --under scripts/ --limit 2 --rewrite-max-tokens 200"Include compact code snippets in the retrieved context for richer rewrites (trades a bit of speed for quality):
# Enable detail mode (adds short snippets) - works with questions
scripts/ctx.py "Explain the caching logic" --detail
# Detail mode with commands - gets more specific implementation details
scripts/ctx.py "Add error handling to ctx.py" --detail
# Adjust snippet size if needed (default is 1 line when --detail is used)
make ctx Q="Explain hybrid search" ARGS="--detail --context-lines 2"Notes:
- Default behavior is header-only (fastest).
--detailadds short snippets. - If
--detailis set and--context-linesremains at its default (0), ctx.py automatically uses 1 line to keep snippets concise. Override with--context-lines N. - Detail mode is optimized for speed: automatically clamps to max 4 results and 1 result per file.
Use --unicorn for the highest quality prompt enhancement with a staged 2-3 pass approach:
# Unicorn mode with commands - produces exceptional, detailed instructions
scripts/ctx.py "refactor ctx.py" --unicorn
# Unicorn mode with questions - produces highly intelligent, multi-faceted questions
scripts/ctx.py "what is ReFRAG and how does it work?" --unicorn
# Works with all filters
scripts/ctx.py "add error handling" --unicorn --language pythonHow it works:
Unicorn mode uses multiple LLM passes with progressively richer code context:
- Pass 1 (Draft): Retrieves rich code snippets (8 lines of context per match) to understand the codebase and sharpen the intent
- Pass 2 (Refine): Retrieves even richer snippets (12 lines of context) based on the draft to ground the prompt with concrete code behaviors
- Pass 3 (Polish): Optional cleanup pass that runs only if the output appears generic or incomplete
Key features:
- Code-grounded: References actual code behaviors and patterns from your codebase, not file paths or line numbers
- No hallucinations: Only uses real code from your indexed repository - never invents references
- Multi-paragraph output: Produces detailed, comprehensive prompts that explore multiple aspects
- Works with both questions and commands: Enhances any type of prompt
When to use:
- Normal mode: Quick, everyday prompts (fastest)
- --detail: Richer context without multi-pass overhead (balanced)
- --unicorn: When you need the absolute best prompt quality (highest quality)
All modes now stream tokens as they arrive for instant feedback:
# Streaming is enabled by default - see output appear immediately
scripts/ctx.py "refactor ctx.py" --unicornTo disable streaming (wait for full response):
- Set
"streaming": falsein~/.ctx_config.json
Automatically falls back to context_search with memories when repo search returns no hits:
# If no code matches, ctx.py will search design docs and ADRs
scripts/ctx.py "What is our authentication strategy?"This ensures you get relevant context even when the query doesn't match code directly.
Automatically adjusts limit and context_lines based on query characteristics:
- Short/vague queries → More context for richer grounding
- Queries with file/function names → Lighter settings for speed
# Short query → auto-increases context
scripts/ctx.py "caching"
# Specific query → optimized for speed
scripts/ctx.py "refactor fetch_context function in ctx.py"Enhanced _needs_polish() heuristic automatically triggers a third polish pass when:
- Output is too short (< 180 chars)
- Contains generic/vague language
- Missing concrete code references
- Lacks proper paragraph structure
This happens transparently in --unicorn mode - no user action needed.
Create ~/.ctx_config.json to customize prompt enhancement behavior:
{
"always_include_tests": true,
"prefer_bullet_commands": false,
"extra_instructions": "Always consider error handling and edge cases",
"streaming": true
}Available preferences:
always_include_tests: Add testing considerations to all promptsprefer_bullet_commands: Format commands as bullet pointsextra_instructions: Custom instructions added to every rewritestreaming: Enable/disable streaming output (default: true)
See ctx_config.example.json for a template.
GPU Acceleration (Apple Silicon): For faster prompt rewriting, use the native Metal-accelerated decoder:
# 1. Set USE_GPU_DECODER=1 in your .env file (already set by default)
# 2. Start the native llama.cpp server with Metal GPU
scripts/gpu_toggle.sh start
# Now ctx.py will automatically use the GPU decoder on port 8081
make ctx Q="Explain the caching logic to me in detail"
# Stop the native GPU server
scripts/gpu_toggle.sh stop
# To use Docker decoder instead, set USE_GPU_DECODER=0 in .env and restart:
docker compose up -d llamacppNotes:
- Defaults to the Indexer HTTP RMCP endpoint at http://localhost:8003/mcp (override with MCP_INDEXER_URL)
- Decoder endpoint: automatically detects GPU mode via USE_GPU_DECODER env var (set by gpu_toggle.sh)
- Docker decoder (default): http://localhost:8080/completion
- GPU decoder (after gpu_toggle.sh gpu): http://localhost:8081/completion
- See also:
make ctx
You can index any local folder by mounting it at /work. Three easy ways:
- Make target: index a specific path
make index-path REPO_PATH=/abs/path/to/other/repo [RECREATE=1] [REPO_NAME=name] [COLLECTION=name]- RECREATE=1 drops and recreates the collection before indexing
- Defaults: REPO_NAME and COLLECTION fall back to the folder name
- Make target: index the current working directory
cd /abs/path/to/other/repo
make -C /Users/user/Desktop/Context-Engine index-here [RECREATE=1] [REPO_NAME=name] [COLLECTION=name]- Raw docker compose (one‑shot ingest without Make)
docker compose run --rm \
-v /abs/path/to/other/repo:/work \
indexer --root /work [--recreate]Notes:
- No need to bind-mount this repository; the images bake /app/scripts and set WORK_ROOTS="/work,/app" so utilities import correctly.
- MCP clients can connect to the running servers and operate on whichever folder is mounted at /work.
- Roo (SSE/RMCP): supports both SSE and RMCP connections; see config examples below
- Cline (SSE/RMCP): supports both SSE and RMCP connections; see config examples below
- Windsurf (SSE/RMCP): supports both SSE and RMCP connections; see config examples below
- Zed (SSE): uses mcp-remote bridge via command/args; see config below
- Kiro (SSE): uses mcp-remote bridge via command/args; see config below
- Qodo (RMCP): connects directly to HTTP endpoints; add each tool individually
- OpenAI Codex (RMCP): TOML config for memory/indexer URLs
- Augment (SSE): simple JSON configs for both servers
- AmpCode (SSE): simple URL for both legacy sse endpoints
- Claude Code CLI(SSE): simple JSON configs for both servers
- Verify endpoints
# Qdrant DB
curl -sSf http://localhost:6333/readyz >/dev/null && echo "Qdrant OK"
# Decoder (llama.cpp sidecar)
curl -s http://localhost:8080/health
# SSE endpoints (Memory, Indexer)
curl -sI http://localhost:8000/sse | head -n1
curl -sI http://localhost:8001/sse | head -n1
# RMCP endpoints (HTTP JSON-RPC)
curl -sI http://localhost:8002/mcp | head -n1
curl -sI http://localhost:8003/mcp | head -n1Core
- COLLECTION_NAME: Qdrant collection to use (defaults to repo name if unset in some flows)
- REPO_NAME: Logical name for the indexed repo; stored in payload for filtering
- HOST_INDEX_PATH: Absolute host path to index (mounted to /work in containers)
Indexing / micro-chunks
- INDEX_MICRO_CHUNKS: 1 to enable micro‑chunking; off falls back to line chunks
- MAX_MICRO_CHUNKS_PER_FILE: Cap micro‑chunks per file (e.g., 200 default)
- TOKENIZER_URL, TOKENIZER_PATH: Hugging Face tokenizer.json URL and local path
- USE_TREE_SITTER: 1 to enable tree-sitter parsing (optional; off by default)
Watcher
- WATCH_DEBOUNCE_SECS: Debounce between change events (e.g., 1.5)
- INDEX_UPSERT_BATCH / INDEX_UPSERT_RETRIES / INDEX_UPSERT_BACKOFF: Upsert tuning
- QDRANT_TIMEOUT: Request timeout in seconds for upserts/queries (e.g., 60–90)
- MCP_TOOL_TIMEOUT_SECS: Max duration for long-running MCP tools (index/prune); default 3600s
Reranker
- RERANKER_ONNX_PATH, RERANKER_TOKENIZER_PATH: Paths for local ONNX cross‑encoder
- RERANKER_ENABLED: 1/true to enable, 0/false to disable; default is enabled in server
- Timeouts/failures automatically fall back to hybrid results
Decoder (llama.cpp / GLM)
- REFRAG_DECODER: 1 to enable decoder for context_answer; 0 to disable (default: 1)
- REFRAG_RUNTIME: llamacpp or glm (default: llamacpp)
- LLAMACPP_URL: llama.cpp server endpoint (default: http://llamacpp:8080 or http://host.docker.internal:8081 for GPU)
- LLAMACPP_TIMEOUT_SEC: Decoder request timeout in seconds (default: 300)
- DECODER_MAX_TOKENS: Max tokens for decoder responses (default: 4000)
- REFRAG_DECODER_MODE: prompt or soft (default: prompt; soft requires patched llama.cpp)
- GLM_API_KEY: API key for GLM provider (required when REFRAG_RUNTIME=glm)
- GLM_MODEL: GLM model name (default: glm-4.6)
- USE_GPU_DECODER: 1 for native Metal decoder on host, 0 for Docker (managed by gpu_toggle.sh)
- LLAMACPP_GPU_LAYERS: Number of layers to offload to GPU, -1 for all (default: 32)
ReFRAG (micro-chunking and retrieval)
- REFRAG_MODE: 1 to enable micro-chunking and span budgeting (default: 1)
- REFRAG_GATE_FIRST: 1 to enable mini-vector gating before dense search (default: 1)
- REFRAG_CANDIDATES: Number of candidates for gate-first filtering (default: 200)
- MICRO_BUDGET_TOKENS: Global token budget for context_answer spans (default: 512)
- MICRO_OUT_MAX_SPANS: Max number of spans to return per query (default: 3)
Ports
- FASTMCP_PORT (SSE/RMCP): Override Memory MCP ports (defaults: 8000/8002)
- FASTMCP_INDEXER_PORT (SSE/RMCP): Override Indexer MCP ports (defaults: 8001/8003)
| Name | Description | Default |
|---|---|---|
| COLLECTION_NAME | Qdrant collection name (unified across all repos) | codebase |
| REPO_NAME | Logical repo tag stored in payload for filtering | auto-detect from git/folder |
| HOST_INDEX_PATH | Host path mounted at /work in containers | current repo (.) |
| QDRANT_URL | Qdrant base URL | container: http://qdrant:6333; local: http://localhost:6333 |
| INDEX_MICRO_CHUNKS | Enable token-based micro-chunking | 0 (off) |
| HYBRID_EXPAND | Enable heuristic multi-query expansion | 0 (off) |
| MAX_MICRO_CHUNKS_PER_FILE | Cap micro-chunks per file | 200 |
| TOKENIZER_URL | HF tokenizer.json URL (for Make download) | n/a (use Make target) |
| TOKENIZER_PATH | Local path where tokenizer is saved (Make) | models/tokenizer.json |
| TOKENIZER_JSON | Runtime path for tokenizer (indexer) | models/tokenizer.json |
| USE_TREE_SITTER | Enable tree-sitter parsing (py/js/ts) | 0 (off) |
| WATCH_DEBOUNCE_SECS | Debounce between FS events (watcher) | 1.5 |
| INDEX_UPSERT_BATCH | Upsert batch size (watcher) | 128 |
| INDEX_UPSERT_RETRIES | Retry count (watcher) | 5 |
| MCP_TOOL_TIMEOUT_SECS | Max duration for long-running MCP tools | 3600 |
| INDEX_UPSERT_BACKOFF | Seconds between retries (watcher) | 0.5 |
| QDRANT_TIMEOUT | HTTP timeout seconds | watcher: 60; search: 20 |
| RERANKER_ONNX_PATH | Local ONNX cross-encoder model path | unset (see make setup-reranker) |
| RERANKER_TOKENIZER_PATH | Tokenizer path for reranker | unset |
| RERANKER_ENABLED | Enable reranker by default | 1 (enabled) |
| FASTMCP_PORT | Memory MCP server port (SSE/RMCP) | 8000 (container-internal) |
| FASTMCP_INDEXER_PORT | Indexer MCP server port (SSE/RMCP) | 8001 (container-internal) |
| FASTMCP_HTTP_PORT | Memory RMCP host port mapping | 8002 |
| FASTMCP_INDEXER_HTTP_PORT | Indexer RMCP host port mapping | 8003 |
| FASTMCP_HEALTH_PORT | Health port (memory/indexer) | memory: 18000; indexer: 18001 |
| LLM_EXPAND_MAX | Max alternate queries generated via LLM | 0 |
| REFRAG_DECODER | Enable decoder for context_answer | 1 (enabled) |
| REFRAG_RUNTIME | Decoder backend: llamacpp or glm | llamacpp |
| LLAMACPP_URL | llama.cpp server endpoint | http://llamacpp:8080 or http://host.docker.internal:8081 |
| LLAMACPP_TIMEOUT_SEC | Decoder request timeout | 300 |
| DECODER_MAX_TOKENS | Max tokens for decoder responses | 4000 |
| GLM_API_KEY | API key for GLM provider | unset |
| GLM_MODEL | GLM model name | glm-4.6 |
| USE_GPU_DECODER | Native Metal decoder (1) vs Docker (0) | 0 (docker) |
| REFRAG_MODE | Enable micro-chunking and span budgeting | 1 (enabled) |
| REFRAG_GATE_FIRST | Enable mini-vector gating | 1 (enabled) |
| REFRAG_CANDIDATES | Candidates for gate-first filtering | 200 |
| MICRO_BUDGET_TOKENS | Token budget for context_answer | 512 |
Local (recommended)
- Python 3.11+
- Create venv and install deps:
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt- Run the full suite:
pytest -q- Run a single file or test:
pytest tests/test_ingest_micro_chunks.py -q
pytest tests/test_php_support.py::test_imports -q- Tips:
- RERANKER_ENABLED=0 can speed up some tests locally; functionality still validated via hybrid fallback.
- Some integration tests may start ephemeral containers via testcontainers; ensure Docker is running.
Inside Docker (optional, ad-hoc)
- You can run tests in the indexer image by overriding the entrypoint:
docker compose run --rm --entrypoint pytest mcp-indexer -qNote: the provided dev images focus on runtime; local venv is faster for iterative testing.
- Python, JavaScript/TypeScript, Go, Java, Rust, Shell, Terraform, PowerShell, YAML, C#, PHP
- Handles delete and move: removes/migrates points to avoid stale entries
- Live reloads ignore patterns: changes to .qdrantignore are applied without restart
- path_glob matches against relative paths (e.g., src/**/*.py), not absolute /work paths
- If upserts time out, lower INDEX_UPSERT_BATCH (e.g., 96) or raise QDRANT_TIMEOUT (e.g., 90)
- For very large files, reduce MAX_MICRO_CHUNKS_PER_FILE (e.g., 200) during dev
- GET /mcp may return 400 (normal): the RMCP endpoint is POST-only for JSON-RPC
- SSE requires a session handshake; raw POST /messages without it will error (expected)
curl -sSf http://localhost:6333/readyz >/dev/null && echo "Qdrant OK"
curl -sI http://localhost:8000/sse | head -n1
curl -sI http://localhost:8001/sse | head -n1- Single command to index + search
# Fresh index of your repo and a quick hybrid example
make reindex-hard
make qdrant-status
make hybrid ARGS="--query 'async file watcher' --limit 5 --include-snippet"- Example MCP client configurations
Kiro (SSE):
Create .kiro/settings/mcp.json in your workspace:
{
"mcpServers": {
"qdrant-indexer": { "command": "npx", "args": ["mcp-remote", "http://localhost:8001/sse", "--transport", "sse-only"] },
"memory": { "command": "npx", "args": ["mcp-remote", "http://localhost:8000/sse", "--transport", "sse-only"] }
}
}Zed (SSE):
Add to your Zed settings.json (accessed via Command Palette → "Settings: Open Settings (JSON)"):
{
/// The name of your MCP server
"qdrant-indexer": {
/// The command which runs the MCP server
"command": "npx",
/// The arguments to pass to the MCP server
"args": [
"mcp-remote",
"http://localhost:8001/sse",
"--transport",
"sse-only"
],
/// The environment variables to set
"env": {}
}
}Notes:
- Zed expects MCP servers at the root level of settings.json
- Uses command/args (stdio). mcp-remote bridges to remote SSE endpoints
- If npx prompts, add
-yright after npx:"command": "npx", "args": ["-y", "mcp-remote", ...] - Alternative: Use direct HTTP connection if mcp-remote has issues:
{ "qdrant-indexer": { "type": "http", "url": "http://localhost:8001/sse" } } - For Qodo (RMCP) clients, see "Qodo Integration (RMCP config)" below for the direct
url-based snippet.
- Common troubleshooting
-
Tree-sitter not found or parser errors:
- Feature is optional. If you set USE_TREE_SITTER=1 and see errors, unset it or install tree-sitter deps, then reindex.
-
Tokenizer missing for micro-chunks:
- Run make tokenizer or set TOKENIZER_JSON to a valid tokenizer.json; otherwise we fall back to line-based chunking.
-
SSE “Invalid session ID” when POSTing /messages directly:
- Expected if you didn’t initiate an SSE session first. Use an MCP client (e.g., mcp-remote) to handle the handshake.
-
llama.cpp platform warning on Apple Silicon:
- Prefer the native path above (
scripts/gpu_toggle.sh gpu). If you stick with Docker, addplatform: linux/amd64to the service or ignore the warning during local dev.
- Prefer the native path above (
-
Indexing feels stuck on very large files:
- Use MAX_MICRO_CHUNKS_PER_FILE=200 during dev runs.
-
Watcher timeouts (-9) or Qdrant "ResponseHandlingException: timed out":
- Set watcher-safe defaults to reduce payload size and add headroom during upserts:
# Watcher-safe defaults (compose already applies these to the watcher service) QDRANT_TIMEOUT=60 MAX_MICRO_CHUNKS_PER_FILE=200 INDEX_UPSERT_BATCH=128 INDEX_UPSERT_RETRIES=5 INDEX_UPSERT_BACKOFF=0.5 WATCH_DEBOUNCE_SECS=1.5
- If issues persist, try lowering INDEX_UPSERT_BATCH to 96 or raising QDRANT_TIMEOUT to 90.
ReFRAG background: https://arxiv.org/abs/2509.01092
Endpoints
| Component | URL |
|---|---|
| Memory MCP | http://localhost:8000/sse |
| Indexer MCP | http://localhost:8001/sse |
| Qdrant DB | http://localhost:6333 |
- Memory HTTP (RMCP): http://localhost:8002/mcp
- Indexer HTTP (RMCP): http://localhost:8003/mcp
OpenAI Codex config (RMCP client):
experimental_use_rmcp_client = true
[mcp_servers.memory_http]
url = "http://127.0.0.1:8002/mcp"
[mcp_servers.qdrant_indexer_http]
url = "http://127.0.0.1:8003/mcp"Add this to your workspace-level Kiro config at .kiro/settings/mcp.json (restart Kiro after saving):
{
"mcpServers": {
"qdrant-indexer": { "command": "npx", "args": ["mcp-remote", "http://localhost:8001/sse", "--transport", "sse-only"] },
"memory": { "command": "npx", "args": ["mcp-remote", "http://localhost:8000/sse", "--transport", "sse-only"] }
}
}Notes:
- Kiro expects command/args (stdio).
mcp-remotebridges to remote SSE endpoints. - If
npxprompts in your environment, add-yright afternpx. - Workspace config overrides user-level config (
~/.kiro/settings/mcp.json).
Troubleshooting:
- Error: “Enabled MCP Server must specify a command, ignoring.”
- Fix: Use the
command/argsform above; do not usetype:urlin Kiro.
- Fix: Use the
- ImportError:
deps: No module named 'scripts'when callingmemory_storeon the indexer MCP- Fix applied: server now adds
/workand/apptosys.path. Restartmcp_indexer.
- Fix applied: server now adds
Memory MCP (8000 SSE, 8002 RMCP):
- store(information, metadata?, collection?) — write a memory entry into the default collection (dual vectors: dense + lexical)
- find(query, limit=5, collection?, top_k?) — hybrid memory search over memory-like entries
Indexer/Search MCP (8001 SSE, 8003 RMCP):
- repo_search — hybrid code search (dense + lexical + optional reranker)
- context_search — search that can also blend memory results (include_memories)
- context_answer — natural-language Q&A with retrieval + local LLM (llama.cpp or GLM)
- code_search — alias of repo_search
- repo_search_compat — permissive wrapper that normalizes q/text/queries/top_k payloads
- context_answer_compat — permissive wrapper for context_answer with lenient argument handling
- expand_query(query, max_new?) — LLM-assisted query expansion (generates 1-2 alternates)
- qdrant_index_root — index /work (mounted repo root) with safe defaults
- qdrant_index(subdir?, recreate?, collection?) — index a subdir or recreate collection
- qdrant_prune — remove points for missing files or file_hash mismatch
- qdrant_list — list Qdrant collections
- qdrant_status — collection counts and recent ingestion timestamps
- workspace_info(workspace_path?) — read .codebase/state.json and resolve default collection
- list_workspaces(search_root?) — scan for multiple workspaces in multi-repo environments
- memory_store — convenience memory store from the indexer (uses default collection)
- search_tests_for — intent wrapper for test files
- search_config_for — intent wrapper for likely config files
- search_callers_for — intent wrapper for probable callers/usages
- search_importers_for — intent wrapper for files importing a module/symbol
- change_history_for_path(path) — summarize recent changes using stored metadata
- collection_map - return collection↔repo mappings
- default_collection - set the collection to use for the session
Notes:
- Most search tools accept filters like language, under, path_glob, kind, symbol, ext.
- Reranker enabled by default; timeouts fall back to hybrid results.
- context_answer requires decoder enabled (REFRAG_DECODER=1) with llama.cpp or GLM backend.
Add this to your Qodo MCP settings to target the RMCP (HTTP) endpoints:
{
"mcpServers": {
"memory": { "url": "http://localhost:8002/mcp" },
"qdrant-indexer": { "url": "http://localhost:8003/mcp" }
}
}Note: Qodo can talk to the RMCP endpoints directly, so no mcp-remote wrapper is required.
- Agents connect via MCP over SSE:
- Memory MCP: http://localhost:8000/sse
- Indexer MCP: http://localhost:8001/sse
- Both MCP servers talk to Qdrant inside Docker at http://qdrant:6333 (DB HTTP API)
- Supporting jobs (indexer, watcher, init_payload) write to/read from Qdrant directly
flowchart LR
subgraph Host/IDE
A[IDE Agents]
end
subgraph Docker Network
B(Memory MCP :8000)
C(MCP Indexer :8001)
D[Qdrant DB :6333]
G[[llama.cpp Decoder :8080]]
E[(One-shot Indexer)]
F[(Watcher)]
end
A -- SSE /sse --> B
A -- SSE /sse --> C
B -- HTTP 6333 --> D
C -- HTTP 6333 --> D
E -- HTTP 6333 --> D
F -- HTTP 6333 --> D
C -. HTTP 8080 .-> G
classDef opt stroke-dasharray: 5 5
class G opt
Start Qdrant, the Memory MCP (8000), the Indexer MCP (8001), and run a fresh index of your current repo:
HOST_INDEX_PATH="$(pwd)" FASTMCP_INDEXER_PORT=8001 docker compose up -d qdrant mcp mcp_indexer indexer watcherThen wire your MCP-aware IDE/tooling to:
- Memory MCP: http://localhost:8000/sse
- Indexer MCP: http://localhost:8001/sse
Tip: add watcher to the command if you want live reindex-on-save.
- URL: http://localhost:8000/sse
- Tools:
store,find - Env (used by the indexer to blend memory):
MEMORY_SSE_ENABLED=trueMEMORY_MCP_URL=http://mcp:8000/sseMEMORY_MCP_TIMEOUT=6
IDE/Agent config (recommended):
{
"mcpServers": {
"memory": { "type": "sse", "url": "http://localhost:8000/sse", "disabled": false },
"qdrant-indexer": { "type": "sse", "url": "http://localhost:8001/sse", "disabled": false }
}
}Blended search:
- Use memories when the information isn’t in your repository or is transient/user-authored: conventions, runbooks, decisions, links, known issues, FAQs, “how we do X here”.
- Use code search for facts that live in the repo: APIs, functions/classes, configuration, and cross-file relationships.
- Blend both for tasks like “how to run E2E tests” where instructions (memory) reference scripts in the repo (code).
- Rule of thumb: if you’d write it in a team wiki or ticket comment, store it as a memory; if you’d grep for it, use code search.
We store memory entries as points in Qdrant with a small, consistent payload. Recommended keys:
- kind: "memory" (string) – required. Enables filtering and blending.
- topic: short category string (e.g., "dev-env", "release-process").
- tags: list of strings (e.g., ["qdrant", "indexing", "prod"]).
- source: where this came from (e.g., "chat", "manual", "tool", "issue-123").
- author: who added it (e.g., username or email).
- created_at: ISO8601 timestamp (UTC).
- expires_at: ISO8601 timestamp if this memory should be pruned later.
- repo: optional repo identifier if sharing a Qdrant instance across repos.
- link: optional URL to docs, tickets, or dashboards.
- priority: 0.0–1.0 weight that clients can use to bias ranking when blending.
Notes:
- Keep values small (short strings, small lists). Don’t store large blobs in payload; put details in the
informationtext. - Use lowercase snake_case keys for consistency.
- For secrets/PII: do not store plaintext. Store references or vault paths instead.
Store a memory (via MCP Memory server tool store – use your MCP client):
{
"information": "Run full reset: INDEX_MICRO_CHUNKS=1 MAX_MICRO_CHUNKS_PER_FILE=200 make reset-dev",
"metadata": {
"kind": "memory",
"topic": "dev-env",
"tags": ["make", "reset"],
"source": "chat"
}
}
Find memories (via MCP Memory server tool find):
{
"query": "reset-dev",
"limit": 5
}
Blend memories into code search (Indexer MCP context_search):
{
"query": "async file watcher",
"include_memories": true,
"limit": 5,
"include_snippet": true
}
Tips:
- Use precise queries (2–5 tokens). Add a couple synonyms if needed; the server supports multiple phrasings.
- Combine
topic/tagsin your memory text to make them easier to find (they also live in payload for filtering).
For production-grade backup/migration strategies, see the official Qdrant documentation for snapshots and export/import. For local development, we recommend relying on Docker volumes and reindexing when needed.
Operational notes:
-
Collection name comes from
COLLECTION_NAME(see .env). This stack defaults to a single collection for both code and memories; filtering usesmetadata.kind. -
If you switch to a dedicated memory collection, update the MCP Memory server and the Indexer's memory blending env to point at it.
-
Consider pruning expired memories by filtering
expires_at < now. -
Call
context_searchon :8001 (SSE) or :8003 (RMCP) with{ "include_memories": true }to return both memory and code results.
Different hash lengths are used for different workspace types:
Local Workspaces: repo-name-8charhash
- Example:
Anesidara-e8d0f5fc - Used by local indexer/watcher
- Assumes unique repo names within workspace
Remote Uploads: folder-name-16charhash-8charhash
- Example:
testupload2-04e680d5939dd035-b8b8d4cc - Collision avoidance for duplicate folder names for different codebases
- 16-char hash identifies workspace, 8-char hash identifies collection
- Ensure the Memory MCP is running on :8000 (default in compose).
- Enable SSE memory blending on the Indexer MCP by setting these env vars for the mcp_indexer service (docker-compose.yml):
services:
mcp_indexer:
environment:
- MEMORY_SSE_ENABLED=true
- MEMORY_MCP_URL=http://mcp:8000/sse
- MEMORY_MCP_TIMEOUT=6- Restart the indexer service:
docker compose up -d mcp_indexer- Validate by calling context_search with include_memories=true for a query that matches a stored memory:
{
"query": "your test memory text",
"include_memories": true,
"limit": 5
}Expected: non-zero results with blended items; memory hits will have memory-like payloads (e.g., metadata.kind = "memory").
-
Idempotent + incremental indexing out of the box:
- Skips unchanged files automatically using a file content hash stored in payload (metadata.file_hash)
- De-duplicates per-file points by deleting prior entries for the same path before insert
- Payload indexes are auto-created on first run (metadata.language, metadata.path_prefix, metadata.repo, metadata.kind, metadata.symbol, metadata.symbol_path, metadata.imports, metadata.calls)
-
Commands:
- Full rebuild:
make reindex - Fast incremental:
make index(skips unchanged files) - Health check:
make health(verifies collection vector name/dim, HNSW, and filtered queries with kind/symbol) - Hybrid search:
make hybrid(dense + lexical bump with RRF)
- Full rebuild:
-
Bootstrap all services + index + checks:
make bootstrap -
Discover commands:
make helplists all targets and descriptions -
Ingest Git history:
make history(messages + file lists)- If the repo has no local commits yet, the history ingester will shallow-fetch from the remote (default: origin) and use its HEAD. Configure with
--remoteand--fetch-depth.
- If the repo has no local commits yet, the history ingester will shallow-fetch from the remote (default: origin) and use its HEAD. Configure with
-
Local reranker (ONNX):
make rerank-local(set RERANKER_ONNX_PATH and RERANKER_TOKENIZER_PATH) -
Setup ONNX reranker quickly:
make setup-reranker ONNX_URL=... TOKENIZER_URL=...(updates .env paths) -
Enable Tree-sitter parsing (more accurate symbols/scopes): set
USE_TREE_SITTER=1in.envthen reindex -
Flags (advanced):
- Disable de-duplication:
docker compose run --rm indexer --root /work --no-dedupe - Disable unchanged skipping:
docker compose run --rm indexer --root /work --no-skip-unchanged
- Disable de-duplication:
Notes:
- Named vector remains aligned with the MCP server (fast-bge-base-en-v1.5). If you change EMBEDDING_MODEL, run
make reindexto recreate the collection. - For very large repos, consider running
make indexon a schedule (or pre-commit) to keep Qdrant warm without full reingestion.
The stack uses a single unified codebase collection by default, making multi-repo search seamless:
Index another repo into the same collection:
# From your qdrant directory
make index-here HOST_INDEX_PATH=/path/to/other/repo REPO_NAME=other-repo
# Or with full control:
HOST_INDEX_PATH=/path/to/other/repo \
COLLECTION_NAME=codebase \
REPO_NAME=other-repo \
docker compose run --rm indexer --root /workWhat happens:
- Files from the other repo get indexed into the unified
codebasecollection - Each file is tagged with
metadata.repo = "other-repo"for filtering - Search across all repos by default, or filter by specific repo
Search examples:
# Search across all indexed repos
make hybrid QUERY="authentication logic"
# Filter by specific repo
python scripts/hybrid_search.py \
--query "authentication logic" \
--repo other-repo
# Filter by repo + language
python scripts/hybrid_search.py \
--query "authentication logic" \
--repo other-repo \
--language pythonBenefits:
- One collection = unified search across all your code
- No fragmentation or collection management overhead
- Filter by repo when you need isolation
- All repos share the same vector space for better semantic search
- Run a fused query with several phrasings and metadata-aware boosts:
make rerank- Customize:
- Add more
--queryflags - Prefer language:
--language python - Prefer under path:
--under /work/scripts
- Add more
- Reindex changed files on save (runs until Ctrl+C):
make watch- Collection creation is tuned for higher recall:
m=16,ef_construct=256. - If you change embeddings, run
make reindexto recreate the collection with the tuned HNSW settings.
- Preload the embedding model and warm Qdrant's HNSW search path to reduce first-query latency and improve recall:
make warmOr, since this stack already exposes SSE, you can configure the client to use http://localhost:8000/sse directly (recommended for Cursor/Windsurf).
Most MCP clients let you pass structured tool arguments. The Indexer/search MCP supports applying server-side filters in repo_search/context_search when these keys are present:
language: value matchesmetadata.languagepath_prefix: value matchesmetadata.path_prefix(e.g.,/work/src)kind: value matchesmetadata.kind(e.g.,function,class,method)
Tip: Combine multiple query phrasings and apply these filters for best precision on large codebases.
We added a dockerized indexer that chunks code, embeds with BAAI/bge-base-en-v1.5, and stores metadata (path, path_prefix, language, start_line, end_line, code) in Qdrant. This boosts recall and relevance for the MCP tools.
# Index current workspace (does not drop data)
make index
# Full reindex (drops existing points in the collection)
make reindex
### Companion MCP: Index/Prune/List (Option B)
A second MCP server runs alongside the search MCP and exposes tools:
- qdrant-list: list collections
- qdrant-index: index the mounted path (/work or subdir)
- qdrant-prune: prune stale points for the mounted path
Configuration
- FASTMCP_INDEXER_PORT (default 8001)
- HOST_INDEX_PATH bind-mounts the target repo into /work (read-only)
Add to your agent as a separate MCP endpoint (SSE):
- URL: http://localhost:8001/sse
Example calls (semantics vary by client):
- qdrant-index with args {"subdir":"scripts","recreate":true}
### MCP client configuration examples
Roo (SSE/RMCP):
```json
{
"mcpServers": {
"memory": { "type": "sse", "url": "http://localhost:8000/sse", "disabled": false },
"qdrant-indexer": { "type": "sse", "url": "http://localhost:8001/sse", "disabled": false }
}
}Cline (SSE/RMCP):
{
"mcpServers": {
"memory": { "type": "sse", "url": "http://localhost:8000/sse", "disabled": false },
"qdrant-indexer": { "type": "sse", "url": "http://localhost:8001/sse", "disabled": false }
}
}Windsurf (SSE/RMCP):
{
"mcpServers": {
"memory": { "type": "sse", "url": "http://localhost:8000/sse", "disabled": false },
"qdrant-indexer": { "type": "sse", "url": "http://localhost:8001/sse", "disabled": false }
}
}Windsurf/Cursor (stdio for search + SSE for indexer):
{
"mcpServers": {
"qdrant": {
"command": "uvx",
"args": ["mcp-server-qdrant"],
"env": {
"QDRANT_URL": "http://localhost:6333",
"COLLECTION_NAME": "my-collection",
"EMBEDDING_MODEL": "BAAI/bge-base-en-v1.5"
},
"disabled": false
}
}
}Augment (SSE for both servers – recommended):
{
"mcpServers": {
"memory": { "type": "sse", "url": "http://localhost:8000/sse", "disabled": false },
"qdrant-indexer": { "type": "sse", "url": "http://localhost:8001/sse", "disabled": false }
}
}Qodo (RMCP; add each tool individually):
Note: In Qodo, you must add each MCP tool separately through the UI, not as a single JSON config.
For each tool, use this format:
Tool 1 - memory:
{
"memory": { "url": "http://localhost:8002/mcp" }
}Tool 2 - qdrant-indexer:
{
"qdrant-indexer": { "url": "http://localhost:8003/mcp" }
}- Do not send null values to MCP tools. Omit the field or pass an empty string "" instead.
- qdrant-index examples:
- {"subdir":"","recreate":false,"collection":"my-collection","repo_name":"workspace"}
- {"subdir":"scripts","recreate":true}
- For indexing the repo root with no params, use the zero-arg tool
qdrant_index_root(new) or callqdrant-indexwithsubdir:"".
-
repo_search: run code search without filters or config.
-
Structured fields supported (parity with DSL): language, under, kind, symbol, ext, not_, case, path_regex, path_glob, not_glob
-
Response shaping: compact (bool) returns only path/start_line/end_line
-
Smart default: compact=true when query is an array with multiple queries (unless explicitly set)
-
If include_snippet is true, compact is forced off so snippet fields are returned
-
Glob fields accept a single string or an array; you can also pass a comma-separated string which will be split
-
Query parsing: accepts query or queries; JSON arrays, JSON-stringified arrays, comma-separated strings; also supports q/text aliases
-
Parity note: path_glob/not_glob list handling works in both modes — in-process and subprocess — with OR semantics for path_glob and reject-on-any for not_glob.
-
Examples:
- {"query": "semantic chunking"}
- {"query": ["function to split code", "overlapping chunks"], "limit": 15, "per_path": 3}
- {"query": "watcher debounce", "language": "python", "under": "scripts/", "include_snippet": true, "context_lines": 2}
- {"query": "parser", "ext": "ts", "path_regex": "/services/.+", "compact": true}
- {"query": "adapter", "path_glob": ["/src/", "/pkg/"], "not_glob": "/tests/"}
-
Returns structured results: score, path, symbol, start_line, end_line, and optional snippet; or compact form.
-
-
code_search: alias of repo_search (same args) for easier discovery in some clients.
-
qdrant_status: return collection size and last index times (safe, read-only).
- {"collection": "my-collection"}
Verification:
-
You should see tools from both servers (e.g.,
store,find,repo_search,code_search,context_search,qdrant_list,qdrant_index,qdrant_prune,qdrant_status). -
Call
qdrant_listto confirm Qdrant connectivity. -
Call
qdrant_indexwith args like{ "subdir": "scripts", "recreate": true }to (re)index the mounted repo. -
Call
context_searchwith{ "include_memories": true }to blend memory+code (requires enabling MEMORY_SSE_ENABLED on the indexer service). -
qdrant_list with no args
-
qdrant_prune with no args
Notes:
- The indexer reads env from
.env(QDRANT_URL, COLLECTION_NAME, EMBEDDING_MODEL). - Default chunking: ~120 lines with 20-line overlap.
- Skips typical build/venv directories.
- Populates
metadata.kind,metadata.symbol, andmetadata.symbol_pathfor Python/JS/TS/Go/Java/Rust/Terraform (best-effort), per chunk. - Uses the same collection as the MCP server.
- The indexer now supports a
.qdrantignorefile at the repo root (similar to.gitignore). Use it to exclude directories/files from indexing. - Sensible defaults are excluded automatically (overridable):
/models,/node_modules,/dist,/build,/.venv,/venv,/__pycache__,/.git, and files matching*.onnx,*.bin,*.safetensors,tokenizer.json,*.whl,*.tar.gz. - Override via env or flags:
- Env:
QDRANT_DEFAULT_EXCLUDES=0to disable defaults;QDRANT_IGNORE_FILE=.myignore;QDRANT_EXCLUDES='tokenizer.json,*.onnx,/third_party' - CLI examples:
docker compose run --rm indexer --root /work --ignore-file .qdrantignoredocker compose run --rm indexer --root /work --no-default-excludes --exclude '/vendor' --exclude '*.bin'
- Env:
- Chunking and batching are tunable via env or flags:
INDEX_CHUNK_LINES(default 120),INDEX_CHUNK_OVERLAP(default 20)INDEX_BATCH_SIZE(default 64)INDEX_PROGRESS_EVERY(default 200 files; 0 disables)
If files were deleted or significantly changed outside the indexer, remove stale points safely:
make prune- CLI equivalents:
--chunk-lines,--chunk-overlap,--batch-size,--progress-every. - Recommendations:
- Small repos (<100 files): chunk 80–120, overlap 16–24, batch-size 32–64
- Medium (100s–1k files): chunk 120–160, overlap ~20, batch-size 64–128
- Large monorepos (1k+): start with defaults; consider
INDEX_PROGRESS_EVERY=200for visibility andINDEX_BATCH_SIZE=128if RAM allows
ReFRAG-lite is enabled in this repo and can be toggled via env. It provides:
- Token-level micro-chunking at ingest (tiny k-token windows with stride)
- Compact vector gating and optional gate-first candidate restriction
- Span compaction and a global token budget at search time
Enable and tune:
# Enable compressed retrieval with micro-chunks
REFRAG_MODE=1
INDEX_MICRO_CHUNKS=1
# Micro windowing
MICRO_CHUNK_TOKENS=16
MICRO_CHUNK_STRIDE=8
# Output shaping and budget
MICRO_OUT_MAX_SPANS=3
MICRO_MERGE_LINES=4
MICRO_BUDGET_TOKENS=512
MICRO_TOKENS_PER_LINE=32
# Optional: gate-first using mini vectors to prefilter dense search
REFRAG_GATE_FIRST=0
REFRAG_CANDIDATES=200Reindex after changing chunking:
# Recreate collection (safe for local dev)
docker compose exec mcp_indexer python -c "from scripts.mcp_indexer_server import qdrant_index_root; qdrant_index_root(recreate=True)"What results look like (context_search / code_search return shape):
{
"score": 0.9234,
"path": "scripts/ingest_code.py",
"start_line": 120,
"end_line": 148,
"span_budgeted": true,
"budget_tokens_used": 224,
"components": { "dense": 0.78, "lex": 0.35, "mini": 0.81 },
"why": ["dense", "mini"]
}Notes:
- span_budgeted=true indicates adjacent micro hits were merged and counted toward the global token budget.
- Tune MICRO_* to control prompt footprint. Increase MICRO_MERGE_LINES to merge looser spans; reduce MICRO_OUT_MAX_SPANS for more file diversity.
- Gate-first reduces dense search compute on large collections; keep off for tiny repos.
This stack ships a feature-flagged decoder integration path via a llama.cpp sidecar. It is production-safe by default (off) and can run in a fallback “prompt” mode that uses a compressed textual context. A future “soft” mode will inject projected chunk embeddings into a patched llama.cpp server.
flowchart LR
%% Retrieval side
Q[Query] --> R[Hybrid search + span budgeting]
R --> S[Selected micro-spans]
%% Projection (φ) and modes
S -->|project via φ| P[(Soft embeddings)]
S -. prompt compress .-> C[Compressed prompt]
%% Decoder service
subgraph Decoder
G[[llama.cpp :8080]]
end
%% Mode routing
P -->|soft mode| G
C -->|prompt mode| G
%% Output
G --> O[Completion]
%% Notes
classDef opt stroke-dasharray: 5 5
class C opt
Enable (safe default is off):
REFRAG_DECODER=1
REFRAG_RUNTIME=llamacpp
LLAMACPP_URL=http://llamacpp:8080
REFRAG_DECODER_MODE=prompt # prompt|soft (soft requires patched llama.cpp)
REFRAG_ENCODER_MODEL=BAAI/bge-base-en-v1.5
REFRAG_PHI_PATH=/work/models/refrag_phi_768_to_dmodel.jsonBring up llama.cpp sidecar (optional):
docker compose up -d llamacppMake-based provisioning (recommended):
# downloads a tiny GGUF to ./models/model.gguf (override URL via LLAMACPP_MODEL_URL)
make llamacpp-up
# or just fetch the model without starting the service
make llama-modelOptional: bake the model into the image (no host volume required):
# builds an image that includes the model specified by MODEL_URL
make llamacpp-build-image LLAMACPP_MODEL_URL=https://huggingface.co/.../tiny.gguf
# then in docker-compose.yml, either remove the ./models volume for llamacpp
# or override the service to use image: context-llamacpp:tinyProgrammatic use:
from scripts.refrag_llamacpp import LlamaCppRefragClient
c = LlamaCppRefragClient() # uses LLAMACPP_URL
text = c.generate_with_soft_embeddings("Question: ...\n", soft_embeddings=None, max_tokens=128)Notes:
-
φ file format: JSON 2D array with shape (d_in, d_model). See scripts/refrag_phi.py. Set REFRAG_PHI_PATH to your JSON file.
-
In prompt mode, the client calls /completion on the llama.cpp server with a compressed prompt.
-
In soft mode, the client will require a patched server to accept soft embeddings. The flag ensures no breakage.
Instead of running llama.cpp locally, you can use the GLM API (ZhipuAI) as your decoder backend:
Setup:
# In .env
REFRAG_DECODER=1
REFRAG_RUNTIME=glm # Switch from llamacpp to glm
GLM_API_KEY=your-api-key # Required
GLM_MODEL=glm-4.6 # Optional, defaults to glm-4.6How it works:
- Uses OpenAI SDK with
base_url="https://api.z.ai/api/paas/v4/" - Supports prompt mode only (soft embeddings ignored)
- Handles GLM-4.6's reasoning mode (
reasoning_contentfield) - Drop-in replacement for llama.cpp—same interface, no code changes needed
Switch back to llama.cpp:
REFRAG_RUNTIME=llamacppThe GLM provider is implemented in scripts/refrag_glm.py and automatically selected when REFRAG_RUNTIME=glm.
The context_answer MCP tool answers natural-language questions using retrieval + a decoder sidecar.
- Inputs (most relevant):
query,limit,per_path,budget_tokens,include_snippet,collection,language,path_glob/not_glob - Outputs:
answer(string)citations:[ { path, start_line, end_line, container_path? }, ... ]query: list of query strings actually usedused:{ "gate_first": true|false, "refrag": true|false }
Pipeline
- Hybrid search (gate-first): Uses MINI-vector gating when
REFRAG_GATE_FIRST=1to prefilter candidates, then runs dense+lexical fusion - Micro-span budgeting: Merges adjacent micro hits and applies a global token budget (
REFRAG_MODE=1,MICRO_BUDGET_TOKENS,MICRO_OUT_MAX_SPANS) - Prompt assembly: Builds compact context blocks and a “Sources” footer
- Decoder call: When
REFRAG_DECODER=1, calls the configured runtime (REFRAG_RUNTIME=llamacpporglm) to synthesize the final answer - Return: Answer + citations + usage flags; errors keep citations for debugging
Environment toggles
- Retrieval:
REFRAG_MODE=1,REFRAG_GATE_FIRST=1,REFRAG_CANDIDATES=200 - Budgeting/output:
MICRO_BUDGET_TOKENS,MICRO_OUT_MAX_SPANS - Decoder:
REFRAG_DECODER=1,LLAMACPP_URL=http://localhost:8080
Fallbacks and safety
- If gate-first yields 0 items and no strict language filter is set, the tool automatically retries without gating
- If the decoder call fails, the response contains
{ "error": "..." }pluscitations, so you can still inspect sources
Quick health + example
# Decoder health (llama.cpp sidecar)
curl -s http://localhost:8080/health
# Qdrant
curl -sSf http://localhost:6333/readyz >/dev/null && echo "Qdrant OK"# Minimal local call (uses the running MCP indexer server code)
import os, asyncio
os.environ.update(
QDRANT_URL="http://localhost:6333",
COLLECTION_NAME="my-collection",
REFRAG_MODE="1", REFRAG_GATE_FIRST="1",
REFRAG_DECODER="1", LLAMACPP_URL="http://localhost:8080",
)
from scripts import mcp_indexer_server as srv
async def t():
out = await srv.context_answer(query="How does hybrid search work?", limit=5)
print(out["used"], len(out.get("citations", [])), len(out.get("answer", "")))
asyncio.run(t())Implementation
- See
scripts/mcp_indexer_server.py(context_answertool) for the full pipeline, env knobs, and debug flags (DEBUG_CONTEXT_ANSWER=1).
- The indexer creates payload indexes for efficient filtering.
- When querying (via MCP client or scripts), you can filter by:
metadata.language(e.g., python, typescript, javascript, go, rust)metadata.path_prefix(e.g.,/work/src)metadata.kind(e.g., function, class, method)
- Example: in the provided reranker script you can do:
make rerank ARGS="--language python --under /work/scripts"
### Operational safeguards and troubleshooting
- Tokenizer for micro-chunking: set TOKENIZER_JSON to a valid tokenizer.json path (default: models/tokenizer.json). If missing, the indexer falls back to line-based chunking.
- Cap micro-chunks per file: MAX_MICRO_CHUNKS_PER_FILE (default 2000) to prevent runaway chunk counts on very large files.
- Qdrant client timeout: QDRANT_TIMEOUT (seconds, default 20) applies to all MCP Qdrant calls.
- Memory auto-detect caching: MEMORY_AUTODETECT=1 by default with MEMORY_COLLECTION_TTL_SECS (default 300s) to avoid repeatedly sampling all collections.
- Schema repair: ensure_collection now repairs missing named vectors (lex, and mini when REFRAG_MODE=1) on existing collections.
- Direct Qdrant filter example is shown below; most MCP clients allow passing tool args that map to server-side filters. If your client supports adding structured args to
qdrant-find, prefer these filters to reduce noise.
We create payload indexes to accelerate filtered searches:
metadata.language(keyword)metadata.path_prefix(keyword)metadata.repo(keyword)metadata.kind(keyword)metadata.symbol(keyword)metadata.symbol_path(keyword)metadata.imports(keyword)metadata.calls(keyword)metadata.file_hash(keyword)metadata.ingested_at(keyword)- Git history fields available in payload:
commit_id,author_name,authored_date,message,files
Payload indexes enable fast server-side filters (e.g., language, path_prefix, kind, symbol). Prefer using the MCP tools repo_search/context_search with filter arguments rather than raw Qdrant REST/Python snippets. See the Qdrant documentation if you need low-level API examples.
- Use precise intent + language: “python chunking function for Qdrant indexing”
- Add path hints when you know the area: “under scripts or ingestion code”
- Try 2–3 alternative phrasings (multi-query) and pick the consensus
- Prefer results where
metadata.languagematches your target file - For navigation, prefer results where
metadata.path_prefixmatches your directory
Client tips:
- MCP tools: issue multiple finds with variant phrasings and re-rank by score + metadata match
- Direct Qdrant: use
vector={name: ..., vector: ...}with the named vector above - Data persists in the
qdrant_storageDocker volume. - The MCP server uses SSE transport and will auto-create the collection if it doesn't exist.
- Only FastEmbed models are supported at this time.
The stack includes automatic health checks that detect and fix cache/collection sync issues:
Check collection health:
python scripts/collection_health.py --workspace . --collection codebaseAuto-heal cache issues:
python scripts/collection_health.py --workspace . --collection codebase --auto-healWhat it detects:
- Empty collection with cached files (cache thinks files are indexed but they're not)
- Significant mismatch between cached files and actual collection contents
- Missing metadata in collection points
When to use:
- After manually deleting collections
- If searches return no results despite indexing
- After Qdrant crashes or data loss
- When switching between collection names
Automatic healing:
- Health checks run automatically on watcher and indexer startup
- Cache is cleared when sync issues are detected
- Files are reindexed on next run
- If the MCP servers can’t reach Qdrant, confirm both containers are up:
make ps. - If the SSE port collides, change
FASTMCP_PORTin.envand the mapped port indocker-compose.yml. - If you customize tool descriptions, restart:
make restart. - If searches return no results, check collection health (see above).

