Context-Engine-AI · m1rl0k · Nov 26, 2025 · Nov 26, 2025 · Nov 26, 2025 · Nov 26, 2025
diff --git a/README.md b/README.md
diff --git a/deploy/kubernetes/README.md b/deploy/kubernetes/README.md
@@ -1,5 +1,9 @@
 # Kubernetes Deployment Guide
 
+**Documentation:** [README](../../README.md) · [Configuration](../../docs/CONFIGURATION.md) · [IDE Clients](../../docs/IDE_CLIENTS.md) · [MCP API](../../docs/MCP_API.md) · [ctx CLI](../../docs/CTX_CLI.md) · [Memory Guide](../../docs/MEMORY_GUIDE.md) · [Architecture](../../docs/ARCHITECTURE.md) · [Multi-Repo](../../docs/MULTI_REPO_COLLECTIONS.md) · Kubernetes · [VS Code Extension](../../docs/vscode-extension.md) · [Troubleshooting](../../docs/TROUBLESHOOTING.md) · [Development](../../docs/DEVELOPMENT.md)
+
+---
+
 ## Overview
 
 This directory contains Kubernetes manifests for deploying Context Engine on a remote cluster using **Kustomize**. This enables:

diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md
@@ -1,5 +1,18 @@
 # Context Engine Architecture
 
+**Documentation:** [README](../README.md) · [Configuration](CONFIGURATION.md) · [IDE Clients](IDE_CLIENTS.md) · [MCP API](MCP_API.md) · [ctx CLI](CTX_CLI.md) · [Memory Guide](MEMORY_GUIDE.md) · [Architecture](ARCHITECTURE.md) · [Multi-Repo](MULTI_REPO_COLLECTIONS.md) · [Kubernetes](../deploy/kubernetes/README.md) · [VS Code Extension](vscode-extension.md) · [Troubleshooting](TROUBLESHOOTING.md) · [Development](DEVELOPMENT.md)
+
+---
+
+**On this page:**
+- [Overview](#overview)
+- [Core Principles](#core-principles)
+- [System Architecture](#system-architecture)
+- [Data Flow](#data-flow)
+- [ReFRAG Pipeline](#refrag-pipeline)
+
+---
+
 ## Overview
 
 Context Engine is a production-ready MCP (Model Context Protocol) retrieval stack that unifies code indexing, hybrid search, and optional LLM decoding. It enables teams to ship context-aware AI agents by providing sophisticated semantic and lexical search capabilities with dual-transport compatibility.

diff --git a/docs/CONFIGURATION.md b/docs/CONFIGURATION.md
@@ -0,0 +1,161 @@
+# Configuration Reference
+
+Complete environment variable reference for Context Engine.
+
+**Documentation:** [README](../README.md) · [Configuration](CONFIGURATION.md) · [IDE Clients](IDE_CLIENTS.md) · [MCP API](MCP_API.md) · [ctx CLI](CTX_CLI.md) · [Memory Guide](MEMORY_GUIDE.md) · [Architecture](ARCHITECTURE.md) · [Multi-Repo](MULTI_REPO_COLLECTIONS.md) · [Kubernetes](../deploy/kubernetes/README.md) · [VS Code Extension](vscode-extension.md) · [Troubleshooting](TROUBLESHOOTING.md) · [Development](DEVELOPMENT.md)
+
+---
+
+**On this page:**
+- [Core Settings](#core-settings)
+- [Indexing & Micro-Chunks](#indexing--micro-chunks)
+- [Watcher Settings](#watcher-settings)
+- [Reranker](#reranker)
+- [Decoder (llama.cpp / GLM)](#decoder-llamacpp--glm)
+- [ReFRAG](#refrag)
+- [Ports](#ports)
+- [Search & Expansion](#search--expansion)
+- [Memory Blending](#memory-blending)
+
+---
+
+## Core Settings
+
+| Name | Description | Default |
+|------|-------------|---------|
+| COLLECTION_NAME | Qdrant collection name (unified across all repos) | codebase |
+| REPO_NAME | Logical repo tag stored in payload for filtering | auto-detect from git/folder |
+| HOST_INDEX_PATH | Host path mounted at /work in containers | current repo (.) |
+| QDRANT_URL | Qdrant base URL | container: http://qdrant:6333; local: http://localhost:6333 |
+
+## Indexing & Micro-Chunks
+
+| Name | Description | Default |
+|------|-------------|---------|
+| INDEX_MICRO_CHUNKS | Enable token-based micro-chunking | 0 (off) |
+| MAX_MICRO_CHUNKS_PER_FILE | Cap micro-chunks per file | 200 |
+| TOKENIZER_URL | HF tokenizer.json URL (for Make download) | n/a |
+| TOKENIZER_PATH | Local path where tokenizer is saved (Make) | models/tokenizer.json |
+| TOKENIZER_JSON | Runtime path for tokenizer (indexer) | models/tokenizer.json |
+| USE_TREE_SITTER | Enable tree-sitter parsing (py/js/ts) | 0 (off) |
+| INDEX_CHUNK_LINES | Lines per chunk (non-micro mode) | 120 |
+| INDEX_CHUNK_OVERLAP | Overlap lines between chunks | 20 |
+| INDEX_BATCH_SIZE | Upsert batch size | 64 |
+| INDEX_PROGRESS_EVERY | Log progress every N files | 200 |
+
+## Watcher Settings
+
+| Name | Description | Default |
+|------|-------------|---------|
+| WATCH_DEBOUNCE_SECS | Debounce between FS events | 1.5 |
+| INDEX_UPSERT_BATCH | Upsert batch size (watcher) | 128 |
+| INDEX_UPSERT_RETRIES | Retry count | 5 |
+| INDEX_UPSERT_BACKOFF | Seconds between retries | 0.5 |
+| QDRANT_TIMEOUT | HTTP timeout seconds | watcher: 60; search: 20 |
+| MCP_TOOL_TIMEOUT_SECS | Max duration for long-running MCP tools | 3600 |
+
+## Reranker
+
+| Name | Description | Default |
+|------|-------------|---------|
+| RERANKER_ONNX_PATH | Local ONNX cross-encoder model path | unset |
+| RERANKER_TOKENIZER_PATH | Tokenizer path for reranker | unset |
+| RERANKER_ENABLED | Enable reranker by default | 1 (enabled) |
+
+## Decoder (llama.cpp / GLM)
+
+| Name | Description | Default |
+|------|-------------|---------|
+| REFRAG_DECODER | Enable decoder for context_answer | 1 (enabled) |
+| REFRAG_RUNTIME | Decoder backend: llamacpp or glm | llamacpp |
+| LLAMACPP_URL | llama.cpp server endpoint | http://llamacpp:8080 or http://host.docker.internal:8081 |
+| LLAMACPP_TIMEOUT_SEC | Decoder request timeout | 300 |
+| DECODER_MAX_TOKENS | Max tokens for decoder responses | 4000 |
+| REFRAG_DECODER_MODE | prompt or soft (soft requires patched llama.cpp) | prompt |
+| GLM_API_KEY | API key for GLM provider | unset |
+| GLM_MODEL | GLM model name | glm-4.6 |
+| USE_GPU_DECODER | Native Metal decoder (1) vs Docker (0) | 0 (docker) |
+| LLAMACPP_GPU_LAYERS | Number of layers to offload to GPU, -1 for all | 32 |
+
+## ReFRAG (Micro-Chunking & Retrieval)
+
+| Name | Description | Default |
+|------|-------------|---------|
+| REFRAG_MODE | Enable micro-chunking and span budgeting | 1 (enabled) |
+| REFRAG_GATE_FIRST | Enable mini-vector gating | 1 (enabled) |
+| REFRAG_CANDIDATES | Candidates for gate-first filtering | 200 |
+| MICRO_BUDGET_TOKENS | Token budget for context_answer | 512 |
+| MICRO_OUT_MAX_SPANS | Max spans returned per query | 3 |
+| MICRO_CHUNK_TOKENS | Tokens per micro-chunk window | 16 |
+| MICRO_CHUNK_STRIDE | Stride between windows | 8 |
+| MICRO_MERGE_LINES | Lines to merge adjacent spans | 4 |
+| MICRO_TOKENS_PER_LINE | Estimated tokens per line | 32 |
+
+## Ports
+
+| Name | Description | Default |
+|------|-------------|---------|
+| FASTMCP_PORT | Memory MCP server port (SSE) | 8000 |
+| FASTMCP_INDEXER_PORT | Indexer MCP server port (SSE) | 8001 |
+| FASTMCP_HTTP_PORT | Memory RMCP host port mapping | 8002 |
+| FASTMCP_INDEXER_HTTP_PORT | Indexer RMCP host port mapping | 8003 |
+| FASTMCP_HEALTH_PORT | Health port (memory/indexer) | memory: 18000; indexer: 18001 |
+
+## Search & Expansion
+
+| Name | Description | Default |
+|------|-------------|---------|
+| HYBRID_EXPAND | Enable heuristic multi-query expansion | 0 (off) |
+| LLM_EXPAND_MAX | Max alternate queries via LLM | 0 |
+
+## Memory Blending
+
+| Name | Description | Default |
+|------|-------------|---------|
+| MEMORY_SSE_ENABLED | Enable SSE memory blending | false |
+| MEMORY_MCP_URL | Memory MCP endpoint for blending | http://mcp:8000/sse |
+| MEMORY_MCP_TIMEOUT | Timeout for memory queries | 6 |
+| MEMORY_AUTODETECT | Auto-detect memory collection | 1 |
+| MEMORY_COLLECTION_TTL_SECS | Cache TTL for collection detection | 300 |
+
+---
+
+## Exclusions (.qdrantignore)
+
+The indexer supports a `.qdrantignore` file at the repo root (similar to `.gitignore`).
+
+**Default exclusions** (overridable):
+- `/models`, `/node_modules`, `/dist`, `/build`
+- `/.venv`, `/venv`, `/__pycache__`, `/.git`
+- `*.onnx`, `*.bin`, `*.safetensors`, `tokenizer.json`, `*.whl`, `*.tar.gz`
+
+**Override via env or flags:**
+```bash
+# Disable defaults
+QDRANT_DEFAULT_EXCLUDES=0
+
+# Custom ignore file
+QDRANT_IGNORE_FILE=.myignore
+
+# Additional excludes
+QDRANT_EXCLUDES='tokenizer.json,*.onnx,/third_party'
+```
+
+**CLI examples:**
+```bash
+docker compose run --rm indexer --root /work --ignore-file .qdrantignore
+docker compose run --rm indexer --root /work --no-default-excludes --exclude '/vendor' --exclude '*.bin'
+```
+
+---
+
+## Scaling Recommendations
+
+| Repo Size | Chunk Lines | Overlap | Batch Size |
+|-----------|------------|---------|------------|
+| Small (<100 files) | 80-120 | 16-24 | 32-64 |
+| Medium (100s-1k files) | 120-160 | ~20 | 64-128 |
+| Large (1k+ files) | 120 (default) | 20 | 128+ |
+
+For large monorepos, set `INDEX_PROGRESS_EVERY=200` for visibility.
+
diff --git a/docs/CTX_CLI.md b/docs/CTX_CLI.md
@@ -0,0 +1,166 @@
+# ctx.py - Prompt Enhancer CLI
+
+A thin CLI that retrieves code context and rewrites your input into a better, context-aware prompt using the local LLM decoder. Works with both questions and commands/instructions.
+
+**Documentation:** [README](../README.md) · [Configuration](CONFIGURATION.md) · [IDE Clients](IDE_CLIENTS.md) · [MCP API](MCP_API.md) · [ctx CLI](CTX_CLI.md) · [Memory Guide](MEMORY_GUIDE.md) · [Architecture](ARCHITECTURE.md) · [Multi-Repo](MULTI_REPO_COLLECTIONS.md) · [Kubernetes](../deploy/kubernetes/README.md) · [VS Code Extension](vscode-extension.md) · [Troubleshooting](TROUBLESHOOTING.md) · [Development](DEVELOPMENT.md)
+
+---
+
+**On this page:**
+- [Basic Usage](#basic-usage)
+- [Detail Mode](#detail-mode)
+- [Unicorn Mode](#unicorn-mode)
+- [Advanced Features](#advanced-features)
+- [GPU Acceleration](#gpu-acceleration)
+- [Configuration](#configuration)
+
+---
+
+## Basic Usage
+
+```bash
+# Questions: Enhanced with specific details and multiple aspects
+scripts/ctx.py "What is ReFRAG?"
+
+# Commands: Enhanced with concrete targets and implementation details
+scripts/ctx.py "Refactor ctx.py"
+
+# Via Make target
+make ctx Q="Explain the caching logic to me in detail"
+
+# Filter by language/path or adjust tokens
+make ctx Q="Hybrid search details" ARGS="--language python --under scripts/ --limit 2 --rewrite-max-tokens 200"
+```
+
+## Detail Mode
+
+Include compact code snippets in the retrieved context for richer rewrites (trades speed for quality):
+
+```bash
+# Enable detail mode (adds short snippets)
+scripts/ctx.py "Explain the caching logic" --detail
+
+# Detail mode with commands
+scripts/ctx.py "Add error handling to ctx.py" --detail
+
+# Adjust snippet size (default is 1 line when --detail is used)
+make ctx Q="Explain hybrid search" ARGS="--detail --context-lines 2"
+```
+
+**Notes:**
+- Default behavior is header-only (fastest). `--detail` adds short snippets.
+- Detail mode is optimized for speed: automatically clamps to max 4 results and 1 result per file.
+
+## Unicorn Mode
+
+Use `--unicorn` for the highest quality prompt enhancement with a staged 2-3 pass approach:
+
+```bash
+# Unicorn mode with commands
+scripts/ctx.py "refactor ctx.py" --unicorn
+
+# Unicorn mode with questions
+scripts/ctx.py "what is ReFRAG and how does it work?" --unicorn
+
+# Works with all filters
+scripts/ctx.py "add error handling" --unicorn --language python
+```
+
+**How it works:**
+
+1. **Pass 1 (Draft)**: Retrieves rich code snippets (8 lines of context) to understand the codebase
+2. **Pass 2 (Refine)**: Retrieves even richer snippets (12 lines) to ground the prompt with concrete code
+3. **Pass 3 (Polish)**: Optional cleanup pass if output appears generic or incomplete
+
+**Key features:**
+- **Code-grounded**: References actual code behaviors and patterns
+- **No hallucinations**: Only uses real code from your indexed repository
+- **Multi-paragraph output**: Produces detailed, comprehensive prompts
+- **Works with both questions and commands**
+
+**When to use:**
+- **Normal mode**: Quick, everyday prompts (fastest)
+- **--detail**: Richer context without multi-pass overhead (balanced)
+- **--unicorn**: When you need the absolute best prompt quality
+
+## Advanced Features
+
+### Streaming Output (Default)
+
+All modes stream tokens as they arrive for instant feedback:
+
+```bash
+scripts/ctx.py "refactor ctx.py" --unicorn
+```
+
+To disable streaming, set `"streaming": false` in `~/.ctx_config.json`
+
+### Memory Blending
+
+Automatically falls back to `context_search` with memories when repo search returns no hits:
+
+```bash
+# If no code matches, ctx.py will search design docs and ADRs
+scripts/ctx.py "What is our authentication strategy?"
+```
+
+### Adaptive Context Sizing
+
+Automatically adjusts `limit` and `context_lines` based on query characteristics:
+- **Short/vague queries** → More context for richer grounding
+- **Queries with file/function names** → Lighter settings for speed
+
+### Automatic Quality Assurance
+
+Enhanced `_needs_polish()` heuristic triggers a third polish pass when:
+- Output is too short (< 180 chars)
+- Contains generic/vague language
+- Missing concrete code references
+- Lacks proper paragraph structure
+
+### Personalized Templates
+
+Create `~/.ctx_config.json` to customize behavior:
+
+```json
+{
+  "always_include_tests": true,
+  "prefer_bullet_commands": false,
+  "extra_instructions": "Always consider error handling and edge cases",
+  "streaming": true
+}
+```
+
+**Available preferences:**
+- `always_include_tests`: Add testing considerations to all prompts
+- `prefer_bullet_commands`: Format commands as bullet points
+- `extra_instructions`: Custom instructions added to every rewrite
+- `streaming`: Enable/disable streaming output (default: true)
+
+See `ctx_config.example.json` for a template.
+
+## GPU Acceleration
+
+For faster prompt rewriting, use the native Metal-accelerated decoder:
+
+```bash
+# Start the native llama.cpp server with Metal GPU
+scripts/gpu_toggle.sh start
+
+# Now ctx.py will automatically use the GPU decoder on port 8081
+make ctx Q="Explain the caching logic"
+
+# Stop the native GPU server
+scripts/gpu_toggle.sh stop
+```
+
+## Configuration
+
+| Setting | Description | Default |
+|---------|-------------|---------|
+| MCP_INDEXER_URL | Indexer HTTP RMCP endpoint | http://localhost:8003/mcp |
+| USE_GPU_DECODER | Auto-detect GPU mode | 0 |
+| LLAMACPP_URL | Docker decoder endpoint | http://localhost:8080 |
+
+GPU decoder (after `gpu_toggle.sh gpu`): http://localhost:8081/completion
+
diff --git a/docs/DEVELOPMENT.md b/docs/DEVELOPMENT.md
@@ -2,6 +2,19 @@
 
 This guide covers setting up a development environment, understanding the codebase structure, and contributing to Context Engine.
 
+**Documentation:** [README](../README.md) · [Configuration](CONFIGURATION.md) · [IDE Clients](IDE_CLIENTS.md) · [MCP API](MCP_API.md) · [ctx CLI](CTX_CLI.md) · [Memory Guide](MEMORY_GUIDE.md) · [Architecture](ARCHITECTURE.md) · [Multi-Repo](MULTI_REPO_COLLECTIONS.md) · [Kubernetes](../deploy/kubernetes/README.md) · [VS Code Extension](vscode-extension.md) · [Troubleshooting](TROUBLESHOOTING.md) · [Development](DEVELOPMENT.md)
+
+---
+
+**On this page:**
+- [Prerequisites](#prerequisites)
+- [Quick Start](#quick-start)
+- [Project Structure](#project-structure)
+- [Testing](#testing)
+- [Docker Development](#docker-development)
+
+---
+
 ## Prerequisites
 
 ### Required Software