Skip to content

Kingstan070/Hybrid-RAG-Bot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🚀 Hybrid RAG Support Bot

Structure-Aware Retrieval-Augmented Generation for PDFs

A Local, Privacy-Preserving RAG System Designed for Real-World Documents


📌 Introduction

This project aims to overcome a major limitation in most Retrieval-Augmented Generation (RAG) systems:

RAG systems do not understand document structure — they only process flat text.

PDFs in the real world contain:

  • Different formatting styles
  • Missing/irregular table of contents
  • Long paragraphs without headings
  • Page-level context where meaning depends on structure

Our goal was to build a robust structure-aware RAG pipeline that understands PDFs using:
✔ TOC detection (Level-1 / Level-2 / No TOC)
✔ Chapter + section metadata extraction
✔ Paragraph-based chunking
✔ Keyword extraction using RAKE
✔ Local embeddings using Ollama
✔ Hybrid retrieval — metadata + similarity search
✔ Latency measurement for each stage

What is achieved now:
✔ PDF parsing with structural awareness
✔ Metadata + keyword-rich embeddings stored in ChromaDB
✔ Working CLI retrieval
✔ A functional API (api/app.py) for querying via FastAPI
✔ Fully local execution — no cloud APIs used

This is a working prototype, tested on CPU, built within a limited time.
More testing and improvements are planned in future updates.


🧠 Core Modules Used (and WHY)

Module Purpose
PyMuPDF (fitz) PDF parsing, page extraction, TOC reading
RAKE-NLTK Keyword extraction for boosting retrieval relevance
LangChain Framework to connect vector DB + LLM + retriever
ChromaDB Local vector database to store embeddings & metadata
Ollama Local LLM inference — no API key, fully offline
FastAPI API layer to interact with the RAG system
Pydantic Configuration management (config/settings.py)
Custom Loggers Tracks parsing, embeddings, query latency
NumPy / Pandas /Data handling for chunks & processed CSVs

⚙️ Setup Instructions

🧪 1️⃣ Create Conda Environment

conda env create -f environment.yml
conda activate hybrid_bot

📦 2️⃣ Install Python Dependencies

pip install -r requirements.txt

🤖 3️⃣ Install Ollama (Required)

Download & install from:
🔗 https://ollama.ai/download

📥 4️⃣ Pull Models (Embedding + LLM)

ollama pull mxbai-embed-large     # Embedding model
ollama pull llama3.2:3b           # LLM for generation

▶ Usage — How the System Works

📌 Why Do We Ingest the PDF?

Before retrieval is possible, we must:
✔ Parse PDF + detect structure
✔ Split into meaningful chunks
✔ Extract metadata & keywords
✔ Store in CSV & JSON formats

📌 Why Do We Build ChromaDB?

Once chunks are extracted, we:
✔ Embed them using mxbai-embed-large
✔ Store them inside ChromaDB
✔ Enable retrieval using LangChain Retriever


🚀 Run These Scripts (In Correct Order)

Only required if data/chroma_db folder is empty. If it already exists, you can skip ingestion and directly run the app. You may also delete the existing data/chroma_db folder to rebuild it using any custom PDF.

Step Command Purpose
1️⃣ python -m scripts.run_ingestion --pdf data\raw_pdfs\virtualbox_6.pdf --out data\processed_csv\raw_blocks.json --chunk Parse PDF & generate metadata
2️⃣ python -m scripts.build_chroma_db --input data\processed_csv\raw_blocks_chunked.json Build embeddings + local DB

Note: query_chroma_db.py is only for testing.


🖥️ Run the Application (API + UI)

Once the PDF has been ingested and the ChromaDB is built, you can run the application using either API mode or UI mode:

Mode Command Description
FastAPI uvicorn api.app:app --reload Starts the REST API for querying the RAG system
Streamlit UI streamlit run ui/app.py Launches a minimal front-end interface (optional)

to properly run both using different terminals parallely to

Once both are running:

  • FastAPI will be available at → http://localhost:8000
  • Streamlit UI will open automatically in your browser

📁 Project Structure

Hybrid-RAG-Bot/
│
├── ingestion/              ← PDF parsing + TOC detection + chunking
├── embeddings/             ← Embedding + ChromaDB builder
├── rag/                    ← Retrieval + LLM pipeline
├── api/                    ← FastAPI interface (basic)
├── scripts/                ← RUN THESE FIRST (pipeline scripts)
├── app_logging/            ← Modular logging system
├── config/settings.py      ← Central configuration
├── data/                   ← Output CSVs + vector DB
└── README.md

(Verified via project_snapshot.txt)


⚠ Known Issues / Limitations

Issue Reason
Limited testing Developed only on CPU (time-limited)
Threshold values Tuned manually ("eyeballing") → needs testing
Only one API endpoint exists Due to project deadline
CLI retrieval logs incomplete API-based logging recommended

🔧 Future Fixes & Improvements

🚀 Planned features:

  • Add more API endpoints (upload PDF, rebuild DB, test queries)
  • Automate ingestion & embedding — no CLI required
  • Improve PDF generalization across formats
  • Add Streamlit UI for user-friendly front-end
  • Confidence scoring + metadata filtering
  • Load balancing for large PDFs

📬 Contact

Author: Allwin Kingstan
📧 tallwinkingstan@gmail.com

🔗 GitHub: https://github.com/Kingstan070

About

An advanced hybrid RAG system that ingests technical PDFs, detects chapters using dynamic TOC parsing (even with edge cases), chunks content intelligently, extracts keywords, and prepares metadata for filtering. Built with modular pipelines, local LLMs via Ollama, and optimized for performance and zero hallucination.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages