Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,536 changes: 118 additions & 1,418 deletions README.md

Large diffs are not rendered by default.

4 changes: 4 additions & 0 deletions deploy/kubernetes/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# Kubernetes Deployment Guide

**Documentation:** [README](../../README.md) · [Configuration](../../docs/CONFIGURATION.md) · [IDE Clients](../../docs/IDE_CLIENTS.md) · [MCP API](../../docs/MCP_API.md) · [ctx CLI](../../docs/CTX_CLI.md) · [Memory Guide](../../docs/MEMORY_GUIDE.md) · [Architecture](../../docs/ARCHITECTURE.md) · [Multi-Repo](../../docs/MULTI_REPO_COLLECTIONS.md) · Kubernetes · [VS Code Extension](../../docs/vscode-extension.md) · [Troubleshooting](../../docs/TROUBLESHOOTING.md) · [Development](../../docs/DEVELOPMENT.md)

---

## Overview

This directory contains Kubernetes manifests for deploying Context Engine on a remote cluster using **Kustomize**. This enables:
Expand Down
13 changes: 13 additions & 0 deletions docs/ARCHITECTURE.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,18 @@
# Context Engine Architecture

**Documentation:** [README](../README.md) · [Configuration](CONFIGURATION.md) · [IDE Clients](IDE_CLIENTS.md) · [MCP API](MCP_API.md) · [ctx CLI](CTX_CLI.md) · [Memory Guide](MEMORY_GUIDE.md) · [Architecture](ARCHITECTURE.md) · [Multi-Repo](MULTI_REPO_COLLECTIONS.md) · [Kubernetes](../deploy/kubernetes/README.md) · [VS Code Extension](vscode-extension.md) · [Troubleshooting](TROUBLESHOOTING.md) · [Development](DEVELOPMENT.md)

---

**On this page:**
- [Overview](#overview)
- [Core Principles](#core-principles)
- [System Architecture](#system-architecture)
- [Data Flow](#data-flow)
- [ReFRAG Pipeline](#refrag-pipeline)

---

## Overview

Context Engine is a production-ready MCP (Model Context Protocol) retrieval stack that unifies code indexing, hybrid search, and optional LLM decoding. It enables teams to ship context-aware AI agents by providing sophisticated semantic and lexical search capabilities with dual-transport compatibility.
Expand Down
161 changes: 161 additions & 0 deletions docs/CONFIGURATION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
# Configuration Reference

Complete environment variable reference for Context Engine.

**Documentation:** [README](../README.md) · [Configuration](CONFIGURATION.md) · [IDE Clients](IDE_CLIENTS.md) · [MCP API](MCP_API.md) · [ctx CLI](CTX_CLI.md) · [Memory Guide](MEMORY_GUIDE.md) · [Architecture](ARCHITECTURE.md) · [Multi-Repo](MULTI_REPO_COLLECTIONS.md) · [Kubernetes](../deploy/kubernetes/README.md) · [VS Code Extension](vscode-extension.md) · [Troubleshooting](TROUBLESHOOTING.md) · [Development](DEVELOPMENT.md)

---

**On this page:**
- [Core Settings](#core-settings)
- [Indexing & Micro-Chunks](#indexing--micro-chunks)
- [Watcher Settings](#watcher-settings)
- [Reranker](#reranker)
- [Decoder (llama.cpp / GLM)](#decoder-llamacpp--glm)
- [ReFRAG](#refrag)
- [Ports](#ports)
- [Search & Expansion](#search--expansion)
- [Memory Blending](#memory-blending)

---

## Core Settings

| Name | Description | Default |
|------|-------------|---------|
| COLLECTION_NAME | Qdrant collection name (unified across all repos) | codebase |
| REPO_NAME | Logical repo tag stored in payload for filtering | auto-detect from git/folder |
| HOST_INDEX_PATH | Host path mounted at /work in containers | current repo (.) |
| QDRANT_URL | Qdrant base URL | container: http://qdrant:6333; local: http://localhost:6333 |

## Indexing & Micro-Chunks

| Name | Description | Default |
|------|-------------|---------|
| INDEX_MICRO_CHUNKS | Enable token-based micro-chunking | 0 (off) |
| MAX_MICRO_CHUNKS_PER_FILE | Cap micro-chunks per file | 200 |
| TOKENIZER_URL | HF tokenizer.json URL (for Make download) | n/a |
| TOKENIZER_PATH | Local path where tokenizer is saved (Make) | models/tokenizer.json |
| TOKENIZER_JSON | Runtime path for tokenizer (indexer) | models/tokenizer.json |
| USE_TREE_SITTER | Enable tree-sitter parsing (py/js/ts) | 0 (off) |
| INDEX_CHUNK_LINES | Lines per chunk (non-micro mode) | 120 |
| INDEX_CHUNK_OVERLAP | Overlap lines between chunks | 20 |
| INDEX_BATCH_SIZE | Upsert batch size | 64 |
| INDEX_PROGRESS_EVERY | Log progress every N files | 200 |

## Watcher Settings

| Name | Description | Default |
|------|-------------|---------|
| WATCH_DEBOUNCE_SECS | Debounce between FS events | 1.5 |
| INDEX_UPSERT_BATCH | Upsert batch size (watcher) | 128 |
| INDEX_UPSERT_RETRIES | Retry count | 5 |
| INDEX_UPSERT_BACKOFF | Seconds between retries | 0.5 |
| QDRANT_TIMEOUT | HTTP timeout seconds | watcher: 60; search: 20 |
| MCP_TOOL_TIMEOUT_SECS | Max duration for long-running MCP tools | 3600 |

## Reranker

| Name | Description | Default |
|------|-------------|---------|
| RERANKER_ONNX_PATH | Local ONNX cross-encoder model path | unset |
| RERANKER_TOKENIZER_PATH | Tokenizer path for reranker | unset |
| RERANKER_ENABLED | Enable reranker by default | 1 (enabled) |

## Decoder (llama.cpp / GLM)

| Name | Description | Default |
|------|-------------|---------|
| REFRAG_DECODER | Enable decoder for context_answer | 1 (enabled) |
| REFRAG_RUNTIME | Decoder backend: llamacpp or glm | llamacpp |
| LLAMACPP_URL | llama.cpp server endpoint | http://llamacpp:8080 or http://host.docker.internal:8081 |
| LLAMACPP_TIMEOUT_SEC | Decoder request timeout | 300 |
| DECODER_MAX_TOKENS | Max tokens for decoder responses | 4000 |
| REFRAG_DECODER_MODE | prompt or soft (soft requires patched llama.cpp) | prompt |
| GLM_API_KEY | API key for GLM provider | unset |
| GLM_MODEL | GLM model name | glm-4.6 |
| USE_GPU_DECODER | Native Metal decoder (1) vs Docker (0) | 0 (docker) |
| LLAMACPP_GPU_LAYERS | Number of layers to offload to GPU, -1 for all | 32 |

## ReFRAG (Micro-Chunking & Retrieval)

| Name | Description | Default |
|------|-------------|---------|
| REFRAG_MODE | Enable micro-chunking and span budgeting | 1 (enabled) |
| REFRAG_GATE_FIRST | Enable mini-vector gating | 1 (enabled) |
| REFRAG_CANDIDATES | Candidates for gate-first filtering | 200 |
| MICRO_BUDGET_TOKENS | Token budget for context_answer | 512 |
| MICRO_OUT_MAX_SPANS | Max spans returned per query | 3 |
| MICRO_CHUNK_TOKENS | Tokens per micro-chunk window | 16 |
| MICRO_CHUNK_STRIDE | Stride between windows | 8 |
| MICRO_MERGE_LINES | Lines to merge adjacent spans | 4 |
| MICRO_TOKENS_PER_LINE | Estimated tokens per line | 32 |

## Ports

| Name | Description | Default |
|------|-------------|---------|
| FASTMCP_PORT | Memory MCP server port (SSE) | 8000 |
| FASTMCP_INDEXER_PORT | Indexer MCP server port (SSE) | 8001 |
| FASTMCP_HTTP_PORT | Memory RMCP host port mapping | 8002 |
| FASTMCP_INDEXER_HTTP_PORT | Indexer RMCP host port mapping | 8003 |
| FASTMCP_HEALTH_PORT | Health port (memory/indexer) | memory: 18000; indexer: 18001 |

## Search & Expansion

| Name | Description | Default |
|------|-------------|---------|
| HYBRID_EXPAND | Enable heuristic multi-query expansion | 0 (off) |
| LLM_EXPAND_MAX | Max alternate queries via LLM | 0 |

## Memory Blending

| Name | Description | Default |
|------|-------------|---------|
| MEMORY_SSE_ENABLED | Enable SSE memory blending | false |
| MEMORY_MCP_URL | Memory MCP endpoint for blending | http://mcp:8000/sse |
| MEMORY_MCP_TIMEOUT | Timeout for memory queries | 6 |
| MEMORY_AUTODETECT | Auto-detect memory collection | 1 |
| MEMORY_COLLECTION_TTL_SECS | Cache TTL for collection detection | 300 |

---

## Exclusions (.qdrantignore)

The indexer supports a `.qdrantignore` file at the repo root (similar to `.gitignore`).

**Default exclusions** (overridable):
- `/models`, `/node_modules`, `/dist`, `/build`
- `/.venv`, `/venv`, `/__pycache__`, `/.git`
- `*.onnx`, `*.bin`, `*.safetensors`, `tokenizer.json`, `*.whl`, `*.tar.gz`

**Override via env or flags:**
```bash
# Disable defaults
QDRANT_DEFAULT_EXCLUDES=0

# Custom ignore file
QDRANT_IGNORE_FILE=.myignore

# Additional excludes
QDRANT_EXCLUDES='tokenizer.json,*.onnx,/third_party'
```

**CLI examples:**
```bash
docker compose run --rm indexer --root /work --ignore-file .qdrantignore
docker compose run --rm indexer --root /work --no-default-excludes --exclude '/vendor' --exclude '*.bin'
```

---

## Scaling Recommendations

| Repo Size | Chunk Lines | Overlap | Batch Size |
|-----------|------------|---------|------------|
| Small (<100 files) | 80-120 | 16-24 | 32-64 |
| Medium (100s-1k files) | 120-160 | ~20 | 64-128 |
| Large (1k+ files) | 120 (default) | 20 | 128+ |

For large monorepos, set `INDEX_PROGRESS_EVERY=200` for visibility.

166 changes: 166 additions & 0 deletions docs/CTX_CLI.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
# ctx.py - Prompt Enhancer CLI

A thin CLI that retrieves code context and rewrites your input into a better, context-aware prompt using the local LLM decoder. Works with both questions and commands/instructions.

**Documentation:** [README](../README.md) · [Configuration](CONFIGURATION.md) · [IDE Clients](IDE_CLIENTS.md) · [MCP API](MCP_API.md) · [ctx CLI](CTX_CLI.md) · [Memory Guide](MEMORY_GUIDE.md) · [Architecture](ARCHITECTURE.md) · [Multi-Repo](MULTI_REPO_COLLECTIONS.md) · [Kubernetes](../deploy/kubernetes/README.md) · [VS Code Extension](vscode-extension.md) · [Troubleshooting](TROUBLESHOOTING.md) · [Development](DEVELOPMENT.md)

---

**On this page:**
- [Basic Usage](#basic-usage)
- [Detail Mode](#detail-mode)
- [Unicorn Mode](#unicorn-mode)
- [Advanced Features](#advanced-features)
- [GPU Acceleration](#gpu-acceleration)
- [Configuration](#configuration)

---

## Basic Usage

```bash
# Questions: Enhanced with specific details and multiple aspects
scripts/ctx.py "What is ReFRAG?"

# Commands: Enhanced with concrete targets and implementation details
scripts/ctx.py "Refactor ctx.py"

# Via Make target
make ctx Q="Explain the caching logic to me in detail"

# Filter by language/path or adjust tokens
make ctx Q="Hybrid search details" ARGS="--language python --under scripts/ --limit 2 --rewrite-max-tokens 200"
```

## Detail Mode

Include compact code snippets in the retrieved context for richer rewrites (trades speed for quality):

```bash
# Enable detail mode (adds short snippets)
scripts/ctx.py "Explain the caching logic" --detail

# Detail mode with commands
scripts/ctx.py "Add error handling to ctx.py" --detail

# Adjust snippet size (default is 1 line when --detail is used)
make ctx Q="Explain hybrid search" ARGS="--detail --context-lines 2"
```

**Notes:**
- Default behavior is header-only (fastest). `--detail` adds short snippets.
- Detail mode is optimized for speed: automatically clamps to max 4 results and 1 result per file.

## Unicorn Mode

Use `--unicorn` for the highest quality prompt enhancement with a staged 2-3 pass approach:

```bash
# Unicorn mode with commands
scripts/ctx.py "refactor ctx.py" --unicorn

# Unicorn mode with questions
scripts/ctx.py "what is ReFRAG and how does it work?" --unicorn

# Works with all filters
scripts/ctx.py "add error handling" --unicorn --language python
```

**How it works:**

1. **Pass 1 (Draft)**: Retrieves rich code snippets (8 lines of context) to understand the codebase
2. **Pass 2 (Refine)**: Retrieves even richer snippets (12 lines) to ground the prompt with concrete code
3. **Pass 3 (Polish)**: Optional cleanup pass if output appears generic or incomplete

**Key features:**
- **Code-grounded**: References actual code behaviors and patterns
- **No hallucinations**: Only uses real code from your indexed repository
- **Multi-paragraph output**: Produces detailed, comprehensive prompts
- **Works with both questions and commands**

**When to use:**
- **Normal mode**: Quick, everyday prompts (fastest)
- **--detail**: Richer context without multi-pass overhead (balanced)
- **--unicorn**: When you need the absolute best prompt quality

## Advanced Features

### Streaming Output (Default)

All modes stream tokens as they arrive for instant feedback:

```bash
scripts/ctx.py "refactor ctx.py" --unicorn
```

To disable streaming, set `"streaming": false` in `~/.ctx_config.json`

### Memory Blending

Automatically falls back to `context_search` with memories when repo search returns no hits:

```bash
# If no code matches, ctx.py will search design docs and ADRs
scripts/ctx.py "What is our authentication strategy?"
```

### Adaptive Context Sizing

Automatically adjusts `limit` and `context_lines` based on query characteristics:
- **Short/vague queries** → More context for richer grounding
- **Queries with file/function names** → Lighter settings for speed

### Automatic Quality Assurance

Enhanced `_needs_polish()` heuristic triggers a third polish pass when:
- Output is too short (< 180 chars)
- Contains generic/vague language
- Missing concrete code references
- Lacks proper paragraph structure

### Personalized Templates

Create `~/.ctx_config.json` to customize behavior:

```json
{
"always_include_tests": true,
"prefer_bullet_commands": false,
"extra_instructions": "Always consider error handling and edge cases",
"streaming": true
}
```

**Available preferences:**
- `always_include_tests`: Add testing considerations to all prompts
- `prefer_bullet_commands`: Format commands as bullet points
- `extra_instructions`: Custom instructions added to every rewrite
- `streaming`: Enable/disable streaming output (default: true)

See `ctx_config.example.json` for a template.

## GPU Acceleration

For faster prompt rewriting, use the native Metal-accelerated decoder:

```bash
# Start the native llama.cpp server with Metal GPU
scripts/gpu_toggle.sh start

# Now ctx.py will automatically use the GPU decoder on port 8081
make ctx Q="Explain the caching logic"

# Stop the native GPU server
scripts/gpu_toggle.sh stop
```

## Configuration

| Setting | Description | Default |
|---------|-------------|---------|
| MCP_INDEXER_URL | Indexer HTTP RMCP endpoint | http://localhost:8003/mcp |
| USE_GPU_DECODER | Auto-detect GPU mode | 0 |
| LLAMACPP_URL | Docker decoder endpoint | http://localhost:8080 |

GPU decoder (after `gpu_toggle.sh gpu`): http://localhost:8081/completion

13 changes: 13 additions & 0 deletions docs/DEVELOPMENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,19 @@

This guide covers setting up a development environment, understanding the codebase structure, and contributing to Context Engine.

**Documentation:** [README](../README.md) · [Configuration](CONFIGURATION.md) · [IDE Clients](IDE_CLIENTS.md) · [MCP API](MCP_API.md) · [ctx CLI](CTX_CLI.md) · [Memory Guide](MEMORY_GUIDE.md) · [Architecture](ARCHITECTURE.md) · [Multi-Repo](MULTI_REPO_COLLECTIONS.md) · [Kubernetes](../deploy/kubernetes/README.md) · [VS Code Extension](vscode-extension.md) · [Troubleshooting](TROUBLESHOOTING.md) · [Development](DEVELOPMENT.md)

---

**On this page:**
- [Prerequisites](#prerequisites)
- [Quick Start](#quick-start)
- [Project Structure](#project-structure)
- [Testing](#testing)
- [Docker Development](#docker-development)

---

## Prerequisites

### Required Software
Expand Down
Loading
Loading