A semantic search and question-answering system for research papers using Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs).
This system enables users to:
- Upload research papers (PDF)
- Ask natural language questions
- Receive concise, context-grounded answers with traceable sources
- 📄 PDF parsing with text cleaning
- 🔍 Section segmentation & semantic chunking
- 🧬 Embedding generation with
all-MiniLM-L6-v2 - 📦 Vector indexing with Pinecone
- 🤖 Contextual answer generation using Gemini 1.5 Pro
- 🌐 Web UI for upload and Q&A
- ✅ Transparent answers with source document snippets
The system follows a Retrieval-Augmented Generation (RAG) pipeline with two main phases: Uploading and Querying.
- Document Extraction — PDFs are parsed with PyPDFLoader; text and metadata (title, page count) are extracted from each page.
- Text Cleaning — Raw text is preprocessed: special characters, headers, footers, page numbers, and stop words are removed via regex and NLTK.
- Section Segmentation — Cleaned text is split into logical sections (Abstract, Introduction, Methodology, Results, Conclusion, References) by detecting section headers with regex.
- Chunking — Each section is split into overlapping chunks (1000 characters, 40-character overlap) using
RecursiveCharacterTextSplitterto preserve context across boundaries. - Embedding Generation — Each chunk is converted to a 384-dimensional vector using
all-MiniLM-L6-v2. - Indexing — Embeddings and metadata (section, title, source) are stored in a Pinecone vector index for similarity search.
- Query Input — The user submits a natural language question via the web UI.
- Query Embedding — The query is embedded with the same model as document chunks so both live in the same semantic space.
- Similarity Search — Cosine similarity in Pinecone retrieves the top 5 most relevant chunks.
- Contextual Answer Generation — Retrieved chunks are passed to Gemini 1.5 Pro as context; the LLM synthesizes a concise answer using only this context.
- Response Delivery — The answer and supporting chunks are returned so users can trace and verify sources.
| Component | Technology |
|---|---|
| Vector DB | Pinecone (serverless, AWS, cosine similarity) |
| Embeddings | sentence-transformers/all-MiniLM-L6-v2 (384-dim) |
| LLM | Google Gemini 1.5 Pro (temperature=0.3 for factual answers) |
| Framework | LangChain (document loading, splitting, chains) |
| Web Server | Flask |
upload_file.py— Loads PDFs, cleans text, segments by section, chunks, generates embeddings, and indexes into Pinecone.query_engine.py— Embeds the query, retrieves top‑5 chunks from Pinecone, and runs the RAG chain with Gemini.server.py— Flask app exposing/uploadand/queryendpoints and serving the web UI.
The retriever uses search_type="similarity" with k=5. Retrieved chunks are fed to a prompt instructing the LLM to use only the provided context, answer in up to three sentences, and say when the answer is unknown. Both the generated answer and source chunks are returned for transparency.
chmod +x run_app.sh
./run_app.sh- Clone the repository
git clone https://github.com/ajay-del-bot/research_paper_RAG_chain.git
cd research_paper_RAG_chain- Create and activate a virtual environment
# For Linux/macOS
python3 -m venv venv
source venv/bin/activate
# For Windows
python -m venv venv
venv\Scripts\activate- Install dependencies
pip install -r requirements.txt- Set up environment variables in
.env
PINECONE_API_KEY=YOUR_PINNECONE_API_KEY
GOOGLE_API_KEY=YOUR_GOOGLE_API_KEY
INDEX_NAME='test-db'python3 src/server.py
- Support for tables, figures, equations
- Better layout handling for multi-column PDFs
- User authentication & session history
- Integration with multiple LLMs
