Intelligent Document RAG with exact citation extraction.
Upload PDFs, DOCX, PPTX, or images → ask questions → get answers with precise citations linked to source pages.
Chat with exact citations linked to source pages — click a citation to highlight the passage in the PDF viewer
Citation bounding boxes rendered directly on the PDF page for precise source verification
Interactive knowledge graph — UMAP projection of document chunks with clustering and similarity-based edges
Admin panel — user management with role-based access control (viewer, editor, admin)
| Backend | 14,300+ lines of Python across 50+ modules |
| Frontend | 7,200+ lines of TypeScript / React |
| Tests | 70 test files (55 backend + 15 frontend) with 15,100+ lines |
| LLM Prompts | 15 prompt templates (zero hardcoded in code) |
| Documentation | 2,400+ lines across guides, ADRs, and API reference |
| Infrastructure | 9 Docker configs, 4 deployment modes, 3 Grafana dashboards |
This project was built using AI-assisted development — a spec-driven workflow where a human architect defines the system design and AI agents implement it under review.
How it works:
- Human defines specs — Each of the 50 phases has a detailed specification in
.ralph/specs/covering requirements, architecture decisions, testing criteria, and rollout order - Agents implement — AI coding agents (Ralph for initial phases, Claude Code for refinement and multi-agent workflows) read the spec and implement code, tests, and documentation
- Human reviews and iterates — Every phase goes through review for correctness, security, and architectural consistency before being marked complete
Agent orchestration artifacts:
.ralph/— Phase specs and development roadmap (50 phases, all completed).claude/agents/— Custom specialized subagents (RAG reviewer, security auditor, API consistency checker, test writer).claude/rules/— Domain-specific conventions enforced across agent sessions.claude/skills/— Reusable slash commands for deployment, testing, evaluation, and backups
What this demonstrates:
- Ability to decompose a complex system into 50 well-scoped, sequential phases — each producing a working, testable artifact
- Technical judgment — the human decides architecture (async-first, LiteLLM abstraction, embedding provider protocol, ML service extraction), the agent executes
- Multi-agent orchestration — parallel specialized agents (security audit, RAG review, API consistency, test writing) coordinating on the same codebase via isolated worktrees
| Feature | Description | |
|---|---|---|
| Exact Citations | RAG with source mapping | Answers include verbatim quoted passages with page numbers and bounding box coordinates, rendered as highlights in the PDF viewer |
| Multi-Modal | Figures + vision | Extracts figures from documents and describes them via Ollama vision models (Gemma 3), indexed as searchable chunks |
| Agentic Search | Deep multi-step retrieval | Decomposes complex queries into sub-questions, iterates with self-verification for multi-hop reasoning |
| Knowledge Graph | Visual exploration | Interactive UMAP-based visualization of document chunk relationships with clustering and similarity filtering |
| Semantic Cache | Smart deduplication | Per-user query caching with cosine similarity matching and automatic invalidation on new uploads |
| Multi-Format | PDF, DOCX, PPTX, images | Format-specific parsing with OCR for scanned documents and searchable PDF generation |
| Streaming Chat | Real-time responses | SSE-based streaming with markdown rendering, session management, and export (Markdown/PDF) |
| Guardrails | Safety + quality | Hallucination detection (NLI grounding), prompt injection defense, confidence scoring, input sanitization |
| Air-Gapped | Fully offline | Works entirely offline with local models via Ollama — zero external API calls |
| Observability | Full-stack monitoring | Prometheus + Grafana dashboards, Langfuse LLM tracing, cost tracking, and quality metrics |
graph TB
subgraph Frontend
UI[React / TypeScript / Vite]
end
subgraph Edge
Caddy[Caddy — TLS + Reverse Proxy]
end
subgraph Backend
API[FastAPI]
RAG[RAG Engine + Citations]
Agent[Agentic Retrieval]
Guard[Guardrails Layer]
Ingest[Ingestion Pipeline]
end
subgraph Storage
PG[(PostgreSQL)]
QD[(Qdrant)]
FS[File Storage]
end
subgraph ML["ML Service (optional)"]
Embed[Local Embeddings]
OCR[docTR Neural OCR]
Rerank[Cross-Encoder Reranking]
end
subgraph Vision["Vision (optional)"]
Ollama[Ollama + Gemma 3]
end
subgraph Monitoring["Monitoring (optional)"]
Prom[Prometheus]
Graf[Grafana]
Lang[Langfuse V3]
end
UI --> Caddy --> API
API --> RAG --> QD
API --> Agent --> RAG
API --> Guard
API --> Ingest --> PG
Ingest --> FS
Ingest --> ML
Ingest --> Ollama
RAG --> ML
API --> Prom --> Graf
API --> Lang
Pipeline: Upload document → detect type (digital/scanned) → parse/OCR → chunk with position metadata → embed → store in Qdrant → query → retrieve → generate answer with citations → verify & validate → stream response with highlights
- Python 3.12+, Node.js 22+, Docker
- An LLM API key (Gemini, Anthropic, or OpenAI) — or Ollama for local models
git clone <repo-url> && cd docvault
cp .env.dev .env
# Edit .env with your API key (see Configuration section below)make install # Backend (Python via uv)
make frontend-install # Frontend (Node via pnpm)make dev-up # Qdrant + PostgreSQL + backend + frontendOpen http://localhost:5173 — upload a document and start asking questions.
| Mode | Command | What it runs |
|---|---|---|
| Dev (lightweight) | make dev-up |
Qdrant + PostgreSQL + native backend/frontend |
| Dev (full-featured) | make dev-full |
Above + ML service + Ollama vision |
| Production | make prod |
Full Docker stack with TLS, monitoring, ML, vision |
| Air-gapped | cp .env.airgapped .env && make prod |
Zero external calls — all inference local |
# Stop commands
make dev-down # Stop lightweight dev
make dev-full-down # Stop full-featured dev
make prod-down # Stop production
make status # Show all running servicesAll configuration is via environment variables. Copy .env.dev (development) or .env.production (production) to .env.
# Gemini (default)
DOCVAULT_LLM_MODEL=gemini/gemini-2.0-flash
GEMINI_API_KEY=your-key
# Claude
DOCVAULT_LLM_MODEL=claude-sonnet-4-20250514
ANTHROPIC_API_KEY=your-key
# OpenAI
DOCVAULT_LLM_MODEL=gpt-4o
OPENAI_API_KEY=your-key
# Local via Ollama (no API key needed)
DOCVAULT_LLM_MODEL=ollama/llama3.1:8b
DOCVAULT_LLM_BASE_URL=http://localhost:11434| Variable | Default | Description |
|---|---|---|
| LLM | ||
DOCVAULT_LLM_MODEL |
gemini/gemini-2.0-flash |
LiteLLM model identifier |
DOCVAULT_LLM_BASE_URL |
— | Base URL for local models (e.g., Ollama) |
| Embeddings | ||
DOCVAULT_EMBEDDING_PROVIDER |
api |
api (LiteLLM) or remote (ML service) |
DOCVAULT_EMBEDDING_MODEL |
nomic-ai/nomic-embed-text-v1.5 |
Embedding model name |
DOCVAULT_ML_SERVICE_URL |
— | ML service URL (required when provider=remote) |
DOCVAULT_ML_SHARED_VOLUME |
false |
Use shared Docker volume for file transfer |
DOCVAULT_EMBEDDING_BATCH_SIZE |
32 |
Batch size for embedding requests |
DOCVAULT_EMBEDDING_CONCURRENCY |
3 |
Max concurrent embedding requests |
| Search | ||
DOCVAULT_SEARCH_MODE |
hybrid |
semantic, bm25, or hybrid |
DOCVAULT_HYBRID_SEMANTIC_WEIGHT |
0.7 |
Weight for semantic vs BM25 in hybrid mode |
DOCVAULT_CONFIDENCE_THRESHOLD |
0.3 |
Minimum confidence for search results |
| OCR & Vision | ||
DOCVAULT_OCR_BACKEND |
tesseract |
tesseract or docling (ML service) |
DOCVAULT_OCR_CONCURRENCY |
2 |
Max concurrent OCR operations |
DOCVAULT_VISION_URL |
— | Ollama URL for vision (e.g., http://localhost:11434) |
DOCVAULT_VISION_MODEL |
gemma3:4b |
Vision model for figure description |
| Cache | ||
DOCVAULT_CACHE_ENABLED |
true |
Enable semantic query cache |
DOCVAULT_CACHE_THRESHOLD |
0.92 |
Cosine similarity threshold for cache hits |
DOCVAULT_CACHE_TTL_HOURS |
24 |
Cache entry time-to-live |
| Security | ||
DOCVAULT_JWT_SECRET |
dev-secret-... |
JWT signing key — change in production |
DOCVAULT_JWT_EXPIRY_SECONDS |
3600 |
JWT token lifetime |
DOCVAULT_MAX_UPLOAD_SIZE_MB |
50 |
Maximum upload file size |
DOCVAULT_CORS_ORIGINS |
* |
Allowed CORS origins |
| Infrastructure | ||
DOCVAULT_DATABASE_URL |
postgresql://...localhost:5432/docvault |
PostgreSQL connection URL |
QDRANT_URL |
http://localhost:6333 |
Qdrant vector database URL |
DOCVAULT_UPLOAD_DIR |
./data/uploads |
File storage directory |
DOCVAULT_LOG_LEVEL |
INFO |
Logging level |
DOCVAULT_DEBUG |
true |
Enable Swagger docs (disable in production) |
| Monitoring | ||
LANGFUSE_HOST |
— | Langfuse URL (enables LLM tracing) |
DOCVAULT_COST_ALERT_DAILY_USD |
5.00 |
Daily LLM cost alert threshold |
| Concurrency | ||
DOCVAULT_INGESTION_WORKERS |
2 |
Parallel ingestion workers |
DOCVAULT_FIGURE_CONCURRENCY |
3 |
Max concurrent figure processing |
DOCVAULT_RATE_LIMIT_QUERY |
30/minute |
Query endpoint rate limit |
DOCVAULT_RATE_LIMIT_UPLOAD |
10/minute |
Upload endpoint rate limit |
DOCVAULT_RATE_LIMIT_AUTH |
5/minute |
Auth endpoint rate limit |
| Component | Technology |
|---|---|
| Backend | Python 3.12, FastAPI, async/await |
| LLM Gateway | LiteLLM (Gemini, Claude, GPT, Ollama, etc.) |
| Embeddings | LiteLLM API or local via ML service |
| Vector DB | Qdrant |
| Database | PostgreSQL 16 (asyncpg) |
| Document Parsing | PyMuPDF, Docling (DOCX/PPTX/images), OCR |
| Frontend | React 19, TypeScript (strict), Vite, TailwindCSS |
| Monitoring | Prometheus, Grafana, Langfuse |
| Vision | Ollama + Gemma 3 4B |
| ML Service | docTR OCR, cross-encoder reranking, local embeddings |
| Reverse Proxy | Caddy (SPA routing, API proxy, TLS) |
| Testing | pytest + pytest-asyncio, Vitest + RTL |
| Linting | ruff + mypy strict, ESLint + Prettier |
| Package Managers | uv (backend), pnpm (frontend) |
All endpoints are served under /api. See docs/api-reference.md for the full reference.
| Endpoint | Method | Description |
|---|---|---|
/api/health |
GET | Application status and configuration |
/api/documents/upload |
POST | Upload a document (PDF, DOCX, PPTX, image) |
/api/documents |
GET | List documents with pagination |
/api/documents/{id} |
DELETE | Delete a document and its chunks |
/api/query |
POST | Ask a question (basic or agentic mode) |
/api/query/stream |
POST | Streaming query via SSE |
/api/sessions |
GET/POST | Chat session management |
/api/sessions/{id} |
GET/PATCH/DELETE | Session CRUD + title update |
/api/sessions/{id}/generate-title |
POST | LLM-generated session title |
/api/sessions/{id}/export |
GET | Export session as Markdown or PDF |
/api/feedback |
POST | Submit thumbs up/down feedback |
/api/graph |
GET | Knowledge graph data |
/api/auth/login |
POST | JWT authentication |
/api/observability/traces |
GET | LLM trace listing |
/api/observability/metrics |
GET | Aggregated metrics |
docvault/
├── backend/src/docvault/ — FastAPI app
│ ├── api/ Routes and middleware
│ ├── core/ Config, LLM client, prompts, database, migrations
│ ├── ingestion/ Parsing, OCR, chunking, embedding, vector store
│ ├── rag/ Retrieval, generation, citation extraction
│ ├── agent/ Agentic multi-step retrieval
│ ├── guardrails/ Sanitization, validation, injection/hallucination detection
│ ├── chat/ Session and message storage
│ ├── auth/ JWT + API key auth, RBAC
│ ├── feedback/ User feedback storage
│ └── prompts/ All LLM prompts as .md files
├── backend/tests/ pytest test suite
├── frontend/src/ React SPA
│ ├── components/ UI components
│ ├── hooks/ Custom React hooks
│ ├── services/ Typed API client
│ └── types/ TypeScript interfaces
├── ml-service/ Optional ML service (embeddings, OCR, reranking)
├── docker/ Dockerfiles and Caddyfiles
├── monitoring/ Grafana dashboards and Prometheus config
├── docs/ Guides, API reference, ADRs
├── scripts/ Backup, migration, utility scripts
├── .ralph/ Phase specs and fix plan
└── Makefile All commands
# Development
make dev-up # Start infra + backend + frontend
make dev-down # Stop everything
make dev-full # Full-featured: + ML service + vision
make dev-full-down # Stop full-featured dev
# Production
make prod # Full Docker stack with TLS + monitoring
make prod-down # Stop production
# Status
make status # Show all running services
# Backend
make install # Install Python deps (uv)
make test # Run pytest (testcontainers PostgreSQL)
make lint # Run ruff + mypy strict
# Frontend
make frontend-install # Install Node deps (pnpm)
make frontend-test # Run Vitest
make frontend-lint # Run ESLint + Prettier
# Evaluation
make eval # Run RAG evaluation pipeline
make eval-compare # Compare eval configurations
# Add-ons
make monitoring-up # Prometheus + Grafana + Langfuse
make vision-up # Ollama + Gemma 3 (GPU required)
# Utilities
make backup # Backup PostgreSQL + Qdrant + files
make restore BACKUP=path # Restore from backup
make down-all # Stop all containers
make clean # Remove caches and build artifactsBackend tests use pytest with async support against a real PostgreSQL via testcontainers:
make test # Backend
make frontend-test # Frontend (Vitest + React Testing Library)make lint # ruff check + ruff format --check + mypy --strict
make frontend-lint # eslint + prettier --checkmake monitoring-up # Start Prometheus + Grafana + Langfuse- Prometheus —
http://localhost:9090 - Grafana —
http://localhost:3001(admin / docvault)
Pre-built dashboards: Overview (latency, cost, throughput), LLM Usage & Cost, RAG Quality & Guardrails.
Detailed guides in docs/:
- Architecture Overview — System design and component interactions
- ML Service — GPU service for OCR, embeddings, reranking
- Quick Start — Clone-to-running guide
- Configuration — Complete env var reference
- API Reference — All endpoints with examples
- Troubleshooting — Common issues and fixes
- Performance Tuning — Optimization guide
- Feature Guides: Agentic Mode · Knowledge Graph · Semantic Cache · Multi-Modal · Feedback · Sharing
- ADRs: LiteLLM · Qdrant · PostgreSQL · Prompts as Files · Caddy · Async-First · ML Service · Embedding Providers
See SECURITY.md for vulnerability reporting, credential management, and production hardening checklist.
See CONTRIBUTING.md for development setup, coding conventions, and PR guidelines.