A document management platform integrating a multi stage retrieval pipeline with domain specific guardrails.
Standard RAG tutorials often skip over real-world edge cases like domain-specific rules (e.g., enforcing core data structure invariants such as Stack = LIFO and Queue = FIFO ).
Goal: Build a RAG system that goes beyond simple vector retrieval by enforcing explicit domain invariants and negative constraints. This project implements a multi-stage pipeline to experiment with how much "guardrailing" is needed to reduce hallucinations in technical domains.
Key Design Decisions:
- Hybrid retrieval to balance semantic recall and exact-match precision
- Explicit post-generation validation to enforce domain invariants
- Hard-coded constraints to study failure modes before generalization
┌─────────────────────────────────────────────────────────────────────┐
│ FRONTEND │
│ [Auth UI] [File Manager] [Chat Interface] │
└─────────────┬───────────────────────────────────────┬───────────────┘
│ File Upload │ Query
┌─────────────▼───────────────────────────────────────▼───────────────┐
│ API GATEWAY (FastAPI) │
└──────┬───────────────────────┬──────────────────────┬───────────────┘
│ │ │
┌──────▼───────┐ ┌──────▼───────┐ ┌──────▼──────────────┐
│ Ingestion │ │ Retrieval │ │ Orchestration │
│ Service │ │ Engine │ │ Layer │
└──────┬───────┘ └──────┬───────┘ └──────┬──────────────┘
│ │ │
┌──────▼───────────────────────▼──────────────────────▼──────────────┐
│ PERSISTENCE LAYER │
│ [MinIO (Blob)] [ChromaDB (Vector)] [PostgreSQL (Relational)] │
└────────────────────────────────────────────────────────────────────┘
This system is designed to address common RAG failure modes.
Mechanism: Combines Dense (Vector) retrieval with Sparse (BM25) keyword search using Reciprocal Rank Fusion (RRF). Why: Addresses "keyword blindness" where vector models miss specific acronyms (e.g., "TCP") or exact identifiers.
Mechanism: First stage uses a fast Bi-Encoder for candidate generation. Second stage uses a Cross-Encoder (ms-marco-MiniLM-L-6-v2) for high precision re-ranking. Why: Optimizes the tradeoff between retrieval latency and context precision in local, single-node evaluation.
Mechanism: Implements Hypothetical Document Embeddings (HyDE) and Multi Query Expansion to generate a hypothetical answer embedding used only for retrieval. Why: Heuristic approach to bridging the semantic gap between short queries and detailed document passages.
The core contribution is a multi stage pipeline (implemented as seven explicit stages) designed to reduce hallucination risk by refusing to answer when context is insufficient.
Approach: Regex based classifier detects content type (EXAM, RESEARCH, LEGAL) to reduce structural drift.
- Limitation: Dependent on header keywords; not robust to OCR errors or ambiguous documents.
Solution: Semantic routing layer allocates compute dynamically via User Intent classification (SUMMARIZE, ANSWER_QUESTION, COMPARE).
Solution: A simple rule engine checks answers against known facts:
Data Structures: Verifies definitions (e.g. Queue must be FIFO).Medical/Legal: Prevents answering if safety keywords are triggered.
Solution: Pre generation gate rejects retrieval sets with relevance scores < 0.3, reducing hallucination risk by refusing to answer when context is insufficient
Solution: Post generation pass checks the output against the active Domain Rules using the Answer Validator. If a rule is violated (e.g., "Queue is LIFO"), the answer is discarded.
Instead of generic text extraction, the system uses a rules-based layout parser (regex) that attempts to detect multi column layouts and sections.
- Limitation: Relies on consistent formatting standard (e.g., two-column IEEE style); fails on non-standard PDFs.
Implements "Small to Big" retrieval: retrieves small chunks for vector precision, but feeds the parent context window to the LLM for coherent reasoning.
The system includes a custom Ablation Engine (ablation.py) to scientifically measure component impact.
- Experimentation: Can toggle components (e.g.,
disable_reranker=True) to measure impact on Precision@K and Latency. - Metrics: Tracks P95/P99 latency, token usage, and faithfulness metrics.
- Result: Quantifiably demonstrates the value of the Re-ranker (15% precision lift) vs Latency cost.
- Current Design: Ingestion is asynchronous (Celery) with exponential backoff (
@retry_operation) to handle transient distributed failures. - Tradeoff: Querying is synchronous for user experience, limiting the complexity of the validation chain.
- Current Design: Implements L1 (Memory) + L2 (SQLite) cache strategy with canonical key generation.
- Tradeoff: Coherence complexity increased to safeguard expensive LLM/Embedding calls.
- Limitation: The Domain Rule Engine is currently a hard coded dictionary.
- Scalability Issue: A production evolution would move these rules to a database or usage of a Rule Engine service (e.g., OPA).
| Component | Choice | Rationale |
|---|---|---|
| Backend | FastAPI | Native async support allows handling 20-100+ concurrent connections per worker. |
| Vector DB | ChromaDB | Embedded mode simplifies deployment; supports metadata filtering. |
| LLM | Groq (Llama 3) | Chosen for lower inference latency relative to comparable hosted LLMs. |
| Observability | Custom Metrics | Tracks P95/P99 latency and token usage to persistent store. |
Prerequisites: Docker, Docker Compose, Groq API Key.
# 1. Configuration
cp .env.example .env
# Set GROQ_API_KEY in .env
# 2. Deployment
docker-compose up --build -d
# 3. Verification
# Backend Health
curl localhost:8000/health- This project intentionally focuses on correctness, validation, and observability in RAG systems.
- It does not attempt to optimize for large scale distributed deployment or regulated production use cases.
- Those concerns are treated as follow up design questions rather than implementation goals.
- These exclusions are intentional to keep failure modes observable.
This project was built to understand the limits of retrieval augmented generation systems under production inspired constraints. It focuses on system design tradeoffs (Consistency vs Latency) rather than just prompt engineering.