A modern, containerized, end-to-end Retrieval-Augmented Generation (RAG) system for document Q&A.
Building a robust RAG system involves more than just a script. It requires:
- Reliable Ingestion: Handling file uploads and chunking them intelligently.
- High-Quality Retrieval: Using state-of-the-art embedding models (
bge-m3) and vector databases (Qdrant). - Precision: Re-ranking results (
bge-reranker) to ensure the LLM gets the best context, reducing hallucinations. - Scalability: Decoupling the heavy ML inference from the lightweight application logic.
This project demonstrates a production-ready architecture for such a system.
backend/: FastAPI application for orchestration. Managed withuv.ml-api/: Dedicated microservice for Embeddings and Reranking. Managed withuv.frontend/: React/Vite/Tailwind UI.models_cache/: Shared volume for storing downloaded ML models.qdrant_data/: Persistent storage for the vector database.uploads/: Storage for uploaded documents.
- Quickstart Guide: Learn how to set up and run the system (Docker & Local).
- Architecture: Deep dive into the system design, data flow, and stack choices.
- Modern Stack: Python 3.10+, React 18, FastAPI, Docker.
- Efficient Dependency Management: Uses uv for lightning-fast, reproducible Python environments.
- GPU Acceleration:
ml-apiis optimized for CUDA but degrades gracefully to CPU. - Interactive UI: Clean, responsive chat interface.
- Unified ML API Endpoint: Embedding and reranking use the same service (
:8001).
Use the root Makefile for consistent local/CI commands:
make bootstrap
make lint
make test
make docker-build
make docker-up
make docker-smoke
make down
make clean