📚 Research Paper Q&A Assistant

A semantic search and question-answering system for research papers using Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs).

🚀 Overview

This system enables users to:

Upload research papers (PDF)
Ask natural language questions
Receive concise, context-grounded answers with traceable sources

🔧 Built With:

🧠 Features

📄 PDF parsing with text cleaning
🔍 Section segmentation & semantic chunking
🧬 Embedding generation with all-MiniLM-L6-v2
📦 Vector indexing with Pinecone
🤖 Contextual answer generation using Gemini 1.5 Pro
🌐 Web UI for upload and Q&A
✅ Transparent answers with source document snippets

📋 Methodology

The system follows a Retrieval-Augmented Generation (RAG) pipeline with two main phases: Uploading and Querying.

Uploading Phase

Document Extraction — PDFs are parsed with PyPDFLoader; text and metadata (title, page count) are extracted from each page.
Text Cleaning — Raw text is preprocessed: special characters, headers, footers, page numbers, and stop words are removed via regex and NLTK.
Section Segmentation — Cleaned text is split into logical sections (Abstract, Introduction, Methodology, Results, Conclusion, References) by detecting section headers with regex.
Chunking — Each section is split into overlapping chunks (1000 characters, 40-character overlap) using RecursiveCharacterTextSplitter to preserve context across boundaries.
Embedding Generation — Each chunk is converted to a 384-dimensional vector using all-MiniLM-L6-v2.
Indexing — Embeddings and metadata (section, title, source) are stored in a Pinecone vector index for similarity search.

Querying Phase

Query Input — The user submits a natural language question via the web UI.
Query Embedding — The query is embedded with the same model as document chunks so both live in the same semantic space.
Similarity Search — Cosine similarity in Pinecone retrieves the top 5 most relevant chunks.
Contextual Answer Generation — Retrieved chunks are passed to Gemini 1.5 Pro as context; the LLM synthesizes a concise answer using only this context.
Response Delivery — The answer and supporting chunks are returned so users can trace and verify sources.

⚙️ Implementation

Architecture

Component	Technology
Vector DB	Pinecone (serverless, AWS, cosine similarity)
Embeddings	`sentence-transformers/all-MiniLM-L6-v2` (384-dim)
LLM	Google Gemini 1.5 Pro (temperature=0.3 for factual answers)
Framework	LangChain (document loading, splitting, chains)
Web Server	Flask

Core Modules

upload_file.py — Loads PDFs, cleans text, segments by section, chunks, generates embeddings, and indexes into Pinecone.
query_engine.py — Embeds the query, retrieves top‑5 chunks from Pinecone, and runs the RAG chain with Gemini.
server.py — Flask app exposing /upload and /query endpoints and serving the web UI.

RAG Pipeline

The retriever uses search_type="similarity" with k=5. Retrieved chunks are fed to a prompt instructing the LLM to use only the provided context, answer in up to three sentences, and say when the answer is unknown. Both the generated answer and source chunks are returned for transparency.

Method 1: Using the Setup Script

chmod +x run_app.sh
./run_app.sh

Method 2: Manual Execution

🛠️ Installation & Setup

Clone the repository

git clone https://github.com/ajay-del-bot/research_paper_RAG_chain.git
cd research_paper_RAG_chain

Create and activate a virtual environment

# For Linux/macOS
python3 -m venv venv
source venv/bin/activate

# For Windows
python -m venv venv
venv\Scripts\activate

Install dependencies

pip install -r requirements.txt

Set up environment variables in .env

PINECONE_API_KEY=YOUR_PINNECONE_API_KEY
GOOGLE_API_KEY=YOUR_GOOGLE_API_KEY
INDEX_NAME='test-db'

💻 Running the App

python3 src/server.py

🧪 Future Enhancements

Support for tables, figures, equations
Better layout handling for multi-column PDFs
User authentication & session history
Integration with multiple LLMs

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
data_rp		data_rp
notebook		notebook
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
CS410_Project_Report.pdf		CS410_Project_Report.pdf
Methodology.png		Methodology.png
README.md		README.md
requirements.txt		requirements.txt
run_app.sh		run_app.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📚 Research Paper Q&A Assistant

🚀 Overview

🔧 Built With:

🧠 Features

📋 Methodology

Uploading Phase

Querying Phase

⚙️ Implementation

Architecture

Core Modules

RAG Pipeline

Method 1: Using the Setup Script

Method 2: Manual Execution

🛠️ Installation & Setup

💻 Running the App

🧪 Future Enhancements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📚 Research Paper Q&A Assistant

🚀 Overview

🔧 Built With:

🧠 Features

📋 Methodology

Uploading Phase

Querying Phase

⚙️ Implementation

Architecture

Core Modules

RAG Pipeline

Method 1: Using the Setup Script

Method 2: Manual Execution

🛠️ Installation & Setup

💻 Running the App

🧪 Future Enhancements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages