A highly efficient AI tool for legal professionals to rapidly search, sort, and synthesize knowledge from massive document sets. This project demonstrates an advanced Retrieval-Augmented Generation (RAG) architecture using Groq for real-time, low-latency performance combined together using Langchain.
This application moves beyond basic RAG by incorporating agentic principles and robust data management.
- 🧠 Agentic Retrieval (Multi-Query): Uses the Groq LLM to decompose complex user questions intelligently (Query Division) into multiple sub-queries, ensuring comprehensive context is retrieved from the VectorDB, leading to more accurate answers.
- 🎯 Contextual Compression: Implements an
LLMChainExtractor(a form of Re-ranking) to filter out irrelevant information retrieved by the Multi-Query step, ensuring the final Groq model only sees the most pertinent chunks. - **📄 This document parsing solution utilizes PyPDF and PDFMiner for efficient text extraction from digital PDFs, and intelligently falls back to Tesseract OCR for handling scanned or handwritten documents.
- 🔒 Persistent & Lifecycle Management: Data is stored securely and locally in ChromaDB. It includes an 7-Day Automatic Buffer to manage data lifecycle by deleting old documents in a separate, scheduled background process.
- ⚡ Low-Latency Synthesis: Leverages the Groq API (
llama-3.1-8b-instant) for blazing-fast answer generation.
The application is structured into two main, independently running processes for maximum resilience:
| File / Component | Purpose | Functionality |
|---|---|---|
app.py |
Frontend | Streamlit UI for file upload and querying. |
ingest.py |
Ingestion Pipeline | Handles file reading, encryption, chunking, embedding, and storage. |
search.py |
Agentic RAG Engine | Contains the Multi-Query Retriever, Contextual Compression, and the Groq LLM chain. |
scheduler.py |
Background Process | Runs continuously to automatically delete documents older than 7 days. |
utils.py |
Utilities | Contains file encryption (Fernet) and the robust Hybrid PDF Parser (PyMuPDF + pytesseract). |
- Python 3.10+
- Tesseract OCR Engine: Must be installed separately on your operating system to enable the handwritten document feature.
- Poppler (for Windows/Linux): Required for
PyMuPDFimage rendering if Tesseract is used.
-
Clone the repository:
git clone [YOUR-REPO-URL] cd [YOUR-REPO-NAME] -
Create and activate a virtual environment:
python -m venv venv .\venv\Scripts\activate # Windows # source venv/bin/activate # macOS/Linux
-
Install dependencies:
pip install -r requirements.txt
-
Configure Environment Variables (
.envfile): Create a file named.envin the root directory and add your API key and a secret key:GROQ_API_KEY=your_groq_api_key_here FERNET_KEY=your_fernet_key_here # Optional: CHROMA_PERSIST_DIR=./chroma_db
You must run the frontend and the data lifecycle scheduler in separate terminals.
Run this command in your first terminal:
streamlit run app.py