A local RAG system that indexes documents from filesystem folders, builds a hybrid search index (dense vectors + keyword), and exposes retrieval via MCP to Claude Desktop and Claude Code. Everything runs locally -- a single Python process plus a Qdrant Docker container. No cloud infrastructure required.
- Hybrid search -- dense vector (BGE-M3) + keyword search with RRF fusion
- ONNX cross-encoder reranking for high-precision results
- MCP integration -- works as a tool for Claude Desktop and Claude Code, with server instructions, enriched tool descriptions, and built-in prompts (research, discover, catch-up)
- Multi-format parsing -- PDF, DOCX, TXT, MD via Docling with OCR support
- Filesystem watching -- automatic re-indexing when documents change
- Document summarization -- shells out to any LLM CLI tool (claude, etc.)
- Semantic chunking (opt-in) -- sentence-embedding boundary detection using simplified Max-Min algorithm with BGE-M3, as an alternative to fixed-size chunking
- Auto-generated questions -- LLM generates 3 questions per chunk at index time, enriching dense vectors and keyword search for better recall
- Fully local -- all models run on-device, no API keys needed for core search
- Python 3.11+
- Docker (for Qdrant)
- 4GB+ RAM for embedding and reranker models
- macOS or Linux
# 1. Clone and install
git clone https://github.com/dgourlay/local-rag.git
cd local-rag
make setup # creates venv, installs deps, starts Qdrant
# 2. Activate venv and configure
source .venv/bin/activate
rag init # interactive wizard: pick folders, detect LLM CLI
# 3. Index and search
rag index # scans folders, parses, chunks, embeds (downloads models on first run)
rag search "your query here"That's it. Models (BGE-M3 embeddings ~1.5GB, BGE reranker ~1.2GB) download automatically on first use.
To pre-download models so the first search is instant:
make download-modelsThe main use case is as an MCP tool server so your LLM can search your documents.
Option A -- auto-install:
rag mcp-config --install claude-codeOption B -- manual: Add this to your ~/.claude.json (or project-level
.mcp.json), replacing the Python path with your venv path:
{
"mcpServers": {
"local-rag": {
"command": "/path/to/local-rag/.venv/bin/python",
"args": ["-m", "rag.cli", "serve"]
}
}
}To find your exact Python path, run rag mcp-config --print.
After adding the config, restart Claude Code. The tools will be available immediately -- Claude will automatically use them when you ask about your documents.
Option A -- auto-install:
rag mcp-config --install claude-desktopThis writes to ~/Library/Application Support/Claude/claude_desktop_config.json.
Option B -- manual: Open Claude Desktop settings, go to Developer > MCP Servers, and add:
{
"local-rag": {
"command": "/path/to/local-rag/.venv/bin/python",
"args": ["-m", "rag.cli", "serve"]
}
}Restart Claude Desktop after adding the config.
Option A -- auto-install:
rag mcp-config --install kiroThis writes to ~/.kiro/settings/mcp.json (user-level). For project-level
config, use Option B with .kiro/settings/mcp.json in your project root.
Option B -- manual: Add to ~/.kiro/settings/mcp.json:
{
"mcpServers": {
"local-rag": {
"command": "/path/to/local-rag/.venv/bin/python",
"args": ["-m", "rag.cli", "serve"]
}
}
}Option C -- kiro-cli:
kiro-cli mcp add \
--name "local-rag" \
--scope global \
--command "/path/to/local-rag/.venv/bin/python" \
--args "-m rag.cli serve"Run rag mcp-config --print to get the exact Python path for your venv.
The server registers three prompts, available as slash commands in Claude Code:
| Prompt | Description |
|---|---|
/research |
Deep research on a topic -- scouts documents, extracts evidence, synthesizes findings |
/discover |
Explore what's in the index -- browse recent documents, discover topics |
/catch-up |
Summarize recent changes in a folder or across all indexed documents |
The server also provides ~600 words of instructions guiding the LLM through a recommended scout-then-search-then-drill-down workflow, query tips, and a list of configured folders.
Once connected, your LLM has access to these tools (with enriched descriptions for better LLM tool selection):
| Tool | Description |
|---|---|
search_documents |
Hybrid search with reranking. Accepts query, optional folder_filter, date_filter, top_k, format ("text" or "json"). Returns cited evidence passages with scores. |
quick_search |
Lightweight document-level scan returning document summaries. Faster than search_documents for broad queries. |
get_document_context |
Get document overview (summary + sections) by doc_id, or a chunk with surrounding context by chunk_id. |
list_recent_documents |
List recently indexed documents, optionally filtered by folder. |
get_sync_status |
Check indexing health: total files, indexed count, errors, per-folder breakdown. |
After setup, ask your LLM something like:
"What documents do I have indexed?" (uses
list_recent_documents)"Search my documents for gate operations procedures" (uses
search_documents)
If it responds with content from your documents, the MCP connection is working.
If not, check rag doctor and verify the Python path in your MCP config points
to the correct venv.
| Command | Description |
|---|---|
rag init |
Interactive setup wizard -- configures folders, LLM CLI, Qdrant |
rag index |
Full scan and process all documents in configured folders |
rag index --reindex |
Purge all index data and re-process everything from scratch |
rag index --reindex FILE |
Clear index state for a single file and re-process it |
rag serve |
Start the MCP server (stdio transport by default) |
rag watch |
Filesystem watcher -- auto-indexes on document changes |
rag status |
Dashboard showing document/chunk/error counts, MCP health, liveness |
rag doctor |
Health check -- verifies Qdrant, models, folders |
rag search "query" |
CLI search for testing (--debug for lane/weight details, --top-k N) |
rag mcp-config |
Print or install MCP config (--print, --install) |
Config is TOML, resolved in this order (first match wins):
RAG_CONFIG_PATHenvironment variable./config.tomlin the current directory~/.config/local-rag/config.toml(default, created byrag init)
Only [folders].paths is required. Everything else has sensible defaults.
See config.example.toml for all options with defaults.
Minimal config:
[folders]
paths = ["~/Documents"]Full config with all defaults shown:
[folders]
paths = ["~/Documents", "~/Work"]
extensions = ["pdf", "docx", "txt", "md"]
ignore = ["**/node_modules", "**/.git", "**/venv", "**/__pycache__"]
[database]
path = "~/.local/share/local-rag/metadata.db"
[qdrant]
url = "http://localhost:6333"
collection = "documents"
[embedding]
model = "BAAI/bge-m3"
dimensions = 1024
batch_size = 32
cache_dir = "~/.cache/local-rag/models"
[reranker]
model_path = "~/.cache/local-rag/models/bge-reranker-v2-m3"
top_k_candidates = 30
top_k_final = 10
[summarization]
enabled = true
command = "claude"
args = ["--print", "--max-tokens", "2048"]
timeout_seconds = 60
[chunking]
strategy = "fixed" # "fixed" (default) or "semantic"
similarity_threshold = 0.35 # semantic only: boundary detection threshold
max_chunk_tokens = 768 # semantic only: max tokens per chunk
[questions]
enabled = true # auto-generate 3 questions per chunk at index time
[mcp]
transport = "stdio"
host = "127.0.0.1"
port = 8080
[watcher]
poll_interval_seconds = 5
debounce_seconds = 2
use_polling = false
batch_window_seconds = 10Single Python process handles: filesystem watching (watchdog) -> Docling parsing (in subprocess for memory isolation) -> normalization -> dedup -> chunking (fixed 512-token or opt-in semantic via BGE-M3 sentence embeddings) -> embedding (BGE-M3, 1024-dim) -> auto-question generation (LLM generates 3 questions per chunk, prepended before embedding) -> summarization (LLM CLI) -> Qdrant indexing. Retrieval: 3-lane prefetch (document summaries, section summaries, chunks) -> RRF fusion with layer weighting -> recency boost -> ONNX cross-encoder reranker -> cited evidence returned to the calling LLM.
src/rag/
cli.py # CLI entry points (click)
config.py # TOML config loader
init.py # Setup wizard
types.py # Pydantic models, enums, type aliases
protocols.py # Protocol classes (Embedder, Summarizer, etc.)
results.py # Discriminated union Result types
sync/ # Filesystem scanner + watcher
pipeline/ # classify -> parse -> normalize -> dedup -> chunk -> embed -> questions -> summarize -> index
parser/ # Docling (PDF/DOCX) + text fallback (TXT/MD)
chunker_semantic.py # Semantic chunking (opt-in, Max-Min algorithm)
retrieval/ # 3-lane prefetch + RRF + layer weighting + reranker + citations
mcp/ # MCP server (stdio + HTTP) + 5 tools + 3 prompts + server instructions
prompts.py # MCP prompts (research, discover, catch-up)
db/ # SQLite + Qdrant clients
migrations/ # SQL schema
tests/ # Unit + e2e tests
make lint # ruff check + format check + mypy strict
make test # unit tests (fast, no Docker)
make test-e2e # end-to-end (requires Qdrant + models)
make test-all # lint + test + test-e2e
make format # auto-format with rufflocal-rag stores ML models, config, and data across several directories (~6.5 GB total, mostly models).
| What | Path | Size |
|---|---|---|
| Embedding model (BGE-M3) | ~/.cache/local-rag/models/models--BAAI--bge-m3/ |
~4.3 GB |
| Reranker model (ONNX) | ~/.cache/local-rag/models/bge-reranker-v2-m3/ |
~2.1 GB |
| SQLite database | ~/.local/share/local-rag/metadata.db |
~1 MB |
| Config file | ~/.config/local-rag/config.toml |
< 1 KB |
| Qdrant data | Docker volume local-rag_qdrant_data |
Varies |
# 1. Stop and remove Qdrant container + data
docker compose down -v
# 2. Remove cached models (~6.4 GB)
rm -rf ~/.cache/local-rag
# 3. Remove database and application data
rm -rf ~/.local/share/local-rag
# 4. Remove config
rm -rf ~/.config/local-rag
# 5. Remove MCP config entries (if installed)
# Edit the relevant file and remove the "local-rag" entry:
# Claude Code: ~/.claude.json
# Claude Desktop: ~/Library/Application Support/Claude/claude_desktop_config.json
# Kiro: ~/.kiro/settings/mcp.json
# 6. Uninstall the Python package
pip uninstall local-rag# Remove models only (~6.4 GB, re-downloads on next use)
rm -rf ~/.cache/local-rag/models
# Re-download when ready
make download-models"No config file found" -- Run rag init, or copy config.example.toml to
~/.config/local-rag/config.toml and edit it.
Qdrant connection refused -- Ensure Qdrant is running: docker compose up -d
Model download fails -- Check internet. Clear cache and retry:
rm -rf ~/.cache/local-rag/models/
rag index # re-downloads everythingSearch returns no results -- Check rag status. If zero documents, run
rag index. If documents are indexed, try rag search "query" --debug to see
what's happening at each stage.
MCP not working in Claude Desktop -- Run rag mcp-config --print and
verify the Python path points to your venv. Restart Claude Desktop after
config changes.