A Retrieval-Augmented Generation (RAG) system that enables academic question answering and document summarization over a custom corpus of text and PDF files. The project integrates Pinecone for vector search, LangChain for orchestration, and OpenAI’s GPT-3.5-Turbo (with a local Flan-T5 fallback) to generate grounded responses. A Streamlit interface provides an interactive user experience.
-
Document Ingestion
Supports plain text and PDF files. PDFs are automatically converted to text. -
Vector Search
Uses Pinecone to index document embeddings for semantic retrieval. -
Question Answering
Retrieves relevant passages from the corpus and generates answers grounded in source material. -
Document Summarization
Summarizes any document (text or PDF) into a specified number of sentences. -
Model Fallback
WhenOPENAI_API_KEYis set, uses GPT-3.5-Turbo; otherwise falls back to a localgoogle/flan-t5-basepipeline. -
Interactive UI
Streamlit app for uploading files, rebuilding the index, asking questions, and summarizing documents.
.
├── backend
│ ├── rag_pipeline
│ │ └── rag_engine.py
│ ├── retriever
│ │ ├── pinecone_setup.py
│ │ └── document_retriever.py
│ └── utils
│ ├── document_loader.py
│ └── pdf_loader.py
├── data
│ └── processed_docs
├── frontend
│ └── streamlit_app.py
├── run_query.py
├── requirements.txt
├── .env
└── README.md
- Python 3.8 or later
- A Pinecone account and API key
- (Optional) An OpenAI account and API key for GPT-3.5-Turbo
- Recommended hardware: CPU is sufficient; a GPU will accelerate local model inference
-
Clone the repository
git clone https://github.com/gauravch-code/AI-Research-Assistant.git cd AI-Research-Assistant -
Create and activate a virtual environment
python -m venv venv source venv/bin/activate # macOS/Linux venv\Scripts\activate # Windows
-
Install dependencies
pip install -r requirements.txt
-
Configure environment variables
Create a file named
.envin the project root with the following entries:PINECONE_API_KEY=your-pinecone-key PINECONE_ENVIRONMENT=your-pinecone-environment OPENAI_API_KEY=sk-... # Optional, required for GPT-3.5-Turbo
python run_query.py- Loads and indexes all files in
data/processed_docs/. - Prompts for your question.
- Returns a grounded answer based on indexed documents.
streamlit run frontend/streamlit_app.py-
Upload
.txtor.pdffiles via the sidebar. -
Click Rebuild Index to embed and index all documents.
-
Choose between Ask a Question or Summarize a Document.
- Ask a Question: enter a query about the full corpus.
- Summarize a Document: select a file and specify the number of sentences.
-
Embedding: Each document is converted to embeddings using a HuggingFace embedding model (
all-MiniLM-L6-v2). -
Indexing: Embeddings are stored in Pinecone for efficient similarity search.
-
Retrieval: Given a query, top-k relevant passages are fetched from Pinecone.
-
Chunking: Lengthy passages are split into manageable chunks to respect model context limits.
-
Generation:
- Q&A: the model (GPT-3.5-Turbo or Flan-T5) consumes the retrieved chunks and the question to produce an answer.
- Summarization: the model is prompted to condense the context into the specified number of sentences.
- Model configuration: change
model_nameinrag_engine.pyto switch to GPT-4 or another local model. - Retrieval enhancements: implement hybrid or reranked retrieval strategies.
- Fine-tuning: integrate PEFT or LoRA for domain-specific model adaptation.
- UI improvements: add feedback collection, usage analytics, or custom styling to the Streamlit app.
- Fork the repository and create a new branch.
- Implement your changes, including tests and documentation updates.
- Submit a pull request for review.
- Author: Gaurav Chintakunta
- Email: gchin6@uic.edu
- GitHub: gauravch-code
For any questions or feedback, please open an issue on GitHub or reach out via email.