An AI powered document question answering application that allows users to ask natural language questions about long PDF documents and receive accurate, source grounded answers.
This project demonstrates a production minded implementation of retrieval augmented generation using modern LLM tooling.
90 second walkthrough and setup video:
Many real world workflows depend on large and complex documents that are difficult to search or reason about manually. This project explores how large language models, embeddings, and vector search can be combined to make long form documents immediately useful while minimizing hallucinations.
- Ask natural language questions over arbitrary PDF documents
- Embeddings based retrieval for precise context selection
- Context constrained LLM responses to improve accuracy
- Simple and fast UI designed for iteration and experimentation
- Python
- Streamlit
- OpenAI API
- ChromaDB
- LangChain
Step 1 Clone the repository
git clone https://github.com/Aeh961/Contract-Answerer.git
cd Contract-Answerer
Step 2 Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate
Step 3 Install dependencies
pip install -r requirements.txt
Step 4 Add your documents
mkdir -p data/contracts
Place any PDF files you want to query into the data contracts folder.
Step 5 Set your API key
export OPENAI_API_KEY="your_key_here"
Step 6 Run the application
streamlit run src/app.py
- PDF documents are loaded and split into semantic chunks
- Each chunk is embedded and stored in a vector database
- User questions retrieve the most relevant chunks
- The language model generates answers using only retrieved context
src
app.py Streamlit application
load.py Document ingestion and embedding
data
contracts User provided PDFs
README.md
requirements.txt
- Inline citations with highlighted source text
- Streaming responses for improved user experience
- Document metadata filtering
- Cost and latency optimizations
Built by Abdallah Elhamawi as part of a broader exploration into practical and reliable AI systems.