AI Research Assistant

A Retrieval-Augmented Generation (RAG) system that enables academic question answering and document summarization over a custom corpus of text and PDF files. The project integrates Pinecone for vector search, LangChain for orchestration, and OpenAI’s GPT-3.5-Turbo (with a local Flan-T5 fallback) to generate grounded responses. A Streamlit interface provides an interactive user experience.

Features

Document Ingestion
Supports plain text and PDF files. PDFs are automatically converted to text.
Vector Search
Uses Pinecone to index document embeddings for semantic retrieval.
Question Answering
Retrieves relevant passages from the corpus and generates answers grounded in source material.
Document Summarization
Summarizes any document (text or PDF) into a specified number of sentences.
Model Fallback
When OPENAI_API_KEY is set, uses GPT-3.5-Turbo; otherwise falls back to a local google/flan-t5-base pipeline.
Interactive UI
Streamlit app for uploading files, rebuilding the index, asking questions, and summarizing documents.

Repository Structure


.
├── backend
│   ├── rag_pipeline
│   │   └── rag_engine.py  
│   ├── retriever
│   │   ├── pinecone_setup.py      
│   │   └── document_retriever.py 
│   └── utils
│       ├── document_loader.py     
│       └── pdf_loader.py          
├── data
│   └── processed_docs            
├── frontend
│   └── streamlit_app.py          
├── run_query.py                  
├── requirements.txt              
├── .env                          
└── README.md

Prerequisites

Python 3.8 or later
A Pinecone account and API key
(Optional) An OpenAI account and API key for GPT-3.5-Turbo
Recommended hardware: CPU is sufficient; a GPU will accelerate local model inference

Installation

Clone the repository

git clone https://github.com/gauravch-code/AI-Research-Assistant.git
cd AI-Research-Assistant

Create and activate a virtual environment

python -m venv venv
source venv/bin/activate    # macOS/Linux
venv\Scripts\activate       # Windows

Install dependencies
```
pip install -r requirements.txt
```

Configure environment variables

Create a file named .env in the project root with the following entries:

PINECONE_API_KEY=your-pinecone-key
PINECONE_ENVIRONMENT=your-pinecone-environment
OPENAI_API_KEY=sk-...          # Optional, required for GPT-3.5-Turbo

Usage

Command-Line Interface

python run_query.py

Loads and indexes all files in data/processed_docs/.
Prompts for your question.
Returns a grounded answer based on indexed documents.

Streamlit Web Application

streamlit run frontend/streamlit_app.py

Upload .txt or .pdf files via the sidebar.
Click Rebuild Index to embed and index all documents.
Choose between Ask a Question or Summarize a Document.
- Ask a Question: enter a query about the full corpus.
- Summarize a Document: select a file and specify the number of sentences.

How It Works

Embedding: Each document is converted to embeddings using a HuggingFace embedding model (all-MiniLM-L6-v2).
Indexing: Embeddings are stored in Pinecone for efficient similarity search.
Retrieval: Given a query, top-k relevant passages are fetched from Pinecone.
Chunking: Lengthy passages are split into manageable chunks to respect model context limits.
Generation:
- Q&A: the model (GPT-3.5-Turbo or Flan-T5) consumes the retrieved chunks and the question to produce an answer.
- Summarization: the model is prompted to condense the context into the specified number of sentences.

Customization and Extension

Model configuration: change model_name in rag_engine.py to switch to GPT-4 or another local model.
Retrieval enhancements: implement hybrid or reranked retrieval strategies.
Fine-tuning: integrate PEFT or LoRA for domain-specific model adaptation.
UI improvements: add feedback collection, usage analytics, or custom styling to the Streamlit app.

Contributing

Fork the repository and create a new branch.
Implement your changes, including tests and documentation updates.
Submit a pull request for review.

Contact

Author: Gaurav Chintakunta
Email: gchin6@uic.edu
GitHub: gauravch-code

For any questions or feedback, please open an issue on GitHub or reach out via email.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Research Assistant

Features

Repository Structure

Prerequisites

Installation

Usage

Command-Line Interface

Streamlit Web Application

How It Works

Customization and Extension

Contributing

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
backend		backend
frontend		frontend
models		models
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
run_query.py		run_query.py

Folders and files

Latest commit

History

Repository files navigation

AI Research Assistant

Features

Repository Structure

Prerequisites

Installation

Usage

Command-Line Interface

Streamlit Web Application

How It Works

Customization and Extension

Contributing

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages