A local, production-ready Retrieval-Augmented Generation (RAG) system for building a private knowledge base from documents stored in S3-compatible storage.
The system allows you to:
- upload documents (PDF, DOCX, TXT)
- automatically vectorize them via Airflow
- store embeddings in ChromaDB
- ask natural-language questions via a Streamlit UI
- generate grounded answers using GigaChat API
⸻
Main components:
- MinIO – S3-compatible object storage (documents)
- Airflow – ingestion & vectorization pipeline
- ChromaDB – persistent vector database
- PostgreSQL – service metadata + Airflow metadata
- Streamlit – user interface (upload, search, RAG)
- GigaChat API – LLM for answer generation (You can replace it with any other LLM)
User → Streamlit
↓
Vector Search (Chroma)
↓
Relevant chunks
↓
GigaChat (LLM)
↓
Answer
⸻
| Service | URL |
|---|---|
| Streamlit UI | http://localhost:8501 |
| Airflow UI | http://localhost:8080 |
| MinIO S3 API | http://localhost:9000 |
| MinIO Console | http://localhost:9001 |
| PostgreSQL (service DB) | localhost:15432 |
| PostgreSQL (Airflow DB) | localhost:15433 |
⸻
MinIO
- Access key: minioadmin
- Secret key: minioadmin
Service PostgreSQL
- User: jovyan
- Password: jovyan
- Database: service_s3
- Port: 15432
Airflow PostgreSQL
- User: airflow
- Password: airflow
- Database: airflow
- Port: 15433
Airflow UI
- Authentication: disabled (local only)
- All users are admins
⸻
Create a .env file in the project root:
# GigaChat
GIGACHAT_AUTH_KEY=BASE64_AUTHORIZATION_KEY_FROM_SBER⸻
For the first launch, always run with build:
docker compose down -v
docker compose up --buildWhy:
- builds Airflow & Streamlit images
- installs Python dependencies
- initializes PostgreSQL
- creates persistent Chroma storage
⸻
If Dockerfiles were not changed:
docker compose up -d⸻
DAG name
document_vectorisation
Responsibilities
- scan S3 bucket
- detect new or updated files (via ETag)
- extract text
- chunk documents
- generate embeddings
- store vectors in ChromaDB
Chroma persistence
/opt/chroma
Mounted as a Docker volume and shared between Airflow and Streamlit.
⸻
-
Ask Questions
- natural-language questions
- vector search + LLM answer
- source attribution (file + chunk)
- relevance score filtering
-
Upload Documents
- PDF / DOCX / TXT
- stored in MinIO
- automatically picked up by Airflow
-
My Documents
- list uploaded files
- file size and last modified date
- vectorization status (based on embeddings presence)
⸻
-
Embedding model:
sentence-transformers/all-MiniLM-L6-v2 -
Distance metric: cosine
-
Lower distance = higher relevance
-
Results are filtered by a configurable max distance threshold
⸻
- OAuth token cached in memory (30 minutes)
- Used only for answer generation
- Retrieval is fully local and deterministic
verify=False
This is acceptable only for local development.
⸻
All important data is persisted via Docker volumes:
- postgres_data
- airflow_db_data
- minio_data
- chroma_data
Removing volumes will reset the system.
⸻
MIT (local research and development use)
