A Local, Privacy-Preserving RAG System Designed for Real-World Documents
This project aims to overcome a major limitation in most Retrieval-Augmented Generation (RAG) systems:
RAG systems do not understand document structure — they only process flat text.
PDFs in the real world contain:
- Different formatting styles
- Missing/irregular table of contents
- Long paragraphs without headings
- Page-level context where meaning depends on structure
Our goal was to build a robust structure-aware RAG pipeline that understands PDFs using:
✔ TOC detection (Level-1 / Level-2 / No TOC)
✔ Chapter + section metadata extraction
✔ Paragraph-based chunking
✔ Keyword extraction using RAKE
✔ Local embeddings using Ollama
✔ Hybrid retrieval — metadata + similarity search
✔ Latency measurement for each stage
What is achieved now:
✔ PDF parsing with structural awareness
✔ Metadata + keyword-rich embeddings stored in ChromaDB
✔ Working CLI retrieval
✔ A functional API (api/app.py) for querying via FastAPI
✔ Fully local execution — no cloud APIs used
This is a working prototype, tested on CPU, built within a limited time.
More testing and improvements are planned in future updates.
| Module | Purpose |
|---|---|
PyMuPDF (fitz) |
PDF parsing, page extraction, TOC reading |
| RAKE-NLTK | Keyword extraction for boosting retrieval relevance |
| LangChain | Framework to connect vector DB + LLM + retriever |
| ChromaDB | Local vector database to store embeddings & metadata |
| Ollama | Local LLM inference — no API key, fully offline |
| FastAPI | API layer to interact with the RAG system |
| Pydantic | Configuration management (config/settings.py) |
| Custom Loggers | Tracks parsing, embeddings, query latency |
| NumPy / Pandas | /Data handling for chunks & processed CSVs |
conda env create -f environment.yml
conda activate hybrid_botpip install -r requirements.txtDownload & install from:
🔗 https://ollama.ai/download
ollama pull mxbai-embed-large # Embedding model
ollama pull llama3.2:3b # LLM for generationBefore retrieval is possible, we must:
✔ Parse PDF + detect structure
✔ Split into meaningful chunks
✔ Extract metadata & keywords
✔ Store in CSV & JSON formats
Once chunks are extracted, we:
✔ Embed them using mxbai-embed-large
✔ Store them inside ChromaDB
✔ Enable retrieval using LangChain Retriever
Only required if data/chroma_db folder is empty.
If it already exists, you can skip ingestion and directly run the app.
You may also delete the existing data/chroma_db folder to rebuild it using any custom PDF.
| Step | Command | Purpose |
|---|---|---|
| 1️⃣ | python -m scripts.run_ingestion --pdf data\raw_pdfs\virtualbox_6.pdf --out data\processed_csv\raw_blocks.json --chunk |
Parse PDF & generate metadata |
| 2️⃣ | python -m scripts.build_chroma_db --input data\processed_csv\raw_blocks_chunked.json |
Build embeddings + local DB |
⚠ Note: query_chroma_db.py is only for testing.
Once the PDF has been ingested and the ChromaDB is built, you can run the application using either API mode or UI mode:
| Mode | Command | Description |
|---|---|---|
| FastAPI | uvicorn api.app:app --reload |
Starts the REST API for querying the RAG system |
| Streamlit UI | streamlit run ui/app.py |
Launches a minimal front-end interface (optional) |
to properly run both using different terminals parallely to
Once both are running:
- FastAPI will be available at → http://localhost:8000
- Streamlit UI will open automatically in your browser
Hybrid-RAG-Bot/
│
├── ingestion/ ← PDF parsing + TOC detection + chunking
├── embeddings/ ← Embedding + ChromaDB builder
├── rag/ ← Retrieval + LLM pipeline
├── api/ ← FastAPI interface (basic)
├── scripts/ ← RUN THESE FIRST (pipeline scripts)
├── app_logging/ ← Modular logging system
├── config/settings.py ← Central configuration
├── data/ ← Output CSVs + vector DB
└── README.md
(Verified via project_snapshot.txt)
| Issue | Reason |
|---|---|
| Limited testing | Developed only on CPU (time-limited) |
| Threshold values | Tuned manually ("eyeballing") → needs testing |
| Only one API endpoint exists | Due to project deadline |
| CLI retrieval logs incomplete | API-based logging recommended |
🚀 Planned features:
- Add more API endpoints (upload PDF, rebuild DB, test queries)
- Automate ingestion & embedding — no CLI required
- Improve PDF generalization across formats
- Add Streamlit UI for user-friendly front-end
- Confidence scoring + metadata filtering
- Load balancing for large PDFs
Author: Allwin Kingstan
📧 tallwinkingstan@gmail.com
🔗 GitHub: https://github.com/Kingstan070