This project implements a lightweight Retrieval-Augmented Generation (RAG) system exposed as a REST API using FastAPI.
The system allows users to:
- Upload a PDF document
- Index its content into a vector database (Chroma)
- Ask questions about the document
- Receive grounded answers with supporting source snippets
The architecture combines:
- Dense vector retrieval (SentenceTransformers + Chroma)
- Context-constrained LLM generation (FLAN-T5)
- Deterministic lexical verification for "mention/contain/include" queries
The API is fully containerized with Docker and designed to be reproducible and easy to deploy.
- Python
- FastAPI β REST API
- SentenceTransformers β embedding generation
- Chroma β vector database
- Transformers (FLAN-T5) β answer generation
- Docker β containerization
The system follows a standard Retrieval-Augmented Generation (RAG) pipeline:
User query
β
βΌ
FastAPI `/ask`
β
βΌ
Retriever (Chroma)
β
βΌ
Top-k document chunks
β
βΌ
Context construction
β
βΌ
LLM (FLAN-T5)
β
βΌ
Answer + sources
- A PDF file is uploaded via
/ingest - The document is split into chunks
- Each chunk is embedded using SentenceTransformers (
all-MiniLM-L6-v2) - Embeddings are stored in a Chroma vector database
When a query is sent to /ask:
- The query is embedded using the same embedding model
- Top-k most similar chunks are retrieved from Chroma
- Retrieved chunks are treated as the grounding context
Two answer modes exist:
If the query contains keywords such as:
- "mention"
- "contain"
- "include"
- "appear"
- "occur"
The system performs deterministic lexical matching on retrieved chunks. This avoids unnecessary LLM hallucination for existence-based queries.
For all other queries:
- Retrieved chunks are concatenated (up to
max_context_chars) - A constrained prompt is built:
"Answer using ONLY the context below"
- FLAN-T5 generates the final answer
- Supporting snippets are returned for traceability
- No black-box RAG chains
- Explicit retrieval β context construction β generation
- Transparent source attribution
- Dockerized and reproducible
rag-document-qa-api
β
βββ rag/
β βββ pipeline.py # RAG pipeline (retrieval + generation logic)
β βββ ingestion.py # Document loading, chunking, and indexing
β
βββ tests/
β βββ test_ingestion.py
β βββ test_pipeline.py
β
βββ .dockerignore
βββ .env.example
βββ .gitignore
βββ Dockerfile # Container configuration
βββ LICENSE
βββ README.md
βββ app.py # FastAPI application and API endpoints
βββ requirements.txt # Python dependencies
βββ test.pdf # Example .pdf file for testing
-
FastAPI API layer (
app.py)
Handles document ingestion and question answering endpoints. -
RAG pipeline (
rag/pipeline.py)
Implements retrieval, context construction, and LLM-based answer generation. -
Document ingestion (
rag/ingestion.py)
Splits PDF documents into chunks, generates embeddings, and stores them in Chroma. -
Vector database (
chroma_index/)
Persistent storage for document embeddings used during retrieval.
Uploads and indexes a PDF document.
Request (multipart/form-data):
file: PDF document
Behavior:
- Saves file to
/uploads - Builds a new Chroma index
- Reloads the RAG pipeline
Response:
{
"status": "ok",
"filename": "example.pdf"
}Answers a question based on the indexed document.
Request (application/json):
{
"query": "What is self-attention?"
}Response:
{
"answer": "An attention mechanism relating different positions...",
"sources": [
{
"snippet": "...",
"metadata": { ... }
}
]
}- answer - generated or deterministic response
- sources - top supporting chunks for transparency and traceability
git clone https://github.com/przemekwarnel/rag-document-qa-api.git
python -m venv .venv
source .venv/bin/activatepip install -r requirements.txtuvicorn app:app --reloadhttp://127.0.0.1:8000/docs
docker build -t rag-bot .docker run -p 8000:8000 rag-bothttp://127.0.0.1:8000/docs
curl -X POST "http://127.0.0.1:8000/ingest" \
-F "file=@test.pdf"curl -X POST "http://127.0.0.1:8000/ask" \
-H "Content-Type: application/json" \
-d '{"query":"What is self-attention?"}'- Retrieval is limited to top-k dense similarity search.
- "No" answers in mention-based queries mean:
"not found in top-k retrieved chunks" and do not guarantee absence in the full document.
- Context size is limited by
max_context_chars. - The system indexes one document at a time (new ingest replaces the previous index).
-
Hybrid retrieval (dense + lexical search)
Combining semantic embeddings with a lexical method such as BM25 would improve recall for rare or domain-specific terms. -
Cross-encoder reranking
Retrieved chunks could be reranked using a cross-encoder model to improve precision before passing context to the LLM. -
Multi-document indexing
Instead of replacing the index on each ingest, the system could support multiple documents with metadata-based filtering.
The project intentionally favors:
- Explicit, understandable RAG flow
- Minimal abstraction layers
- Transparency over complexity