A minimal, fully local Retrieval-Augmented Generation (RAG) pipeline built without frameworks. Implements document ingestion, chunking, dense embedding, vector retrieval, and LLM generation as discrete, inspectable steps. No LangChain, no cloud APIs, no hidden abstractions.
RAG grounds a language model's responses in a specific document corpus by retrieving the most semantically relevant passages at query time and conditioning generation on them.
The pipeline runs in two stages:
Ingestion (run once, or when documents change):
documents (.txt / .pdf / .docx)
extract text
split into overlapping character chunks
embed each chunk with a sentence-transformer
store chunks and embeddings in ChromaDB
Query (run any number of times):
your question
embed with the same model
find the top-k most similar chunks in ChromaDB
build a prompt: system instruction + retrieved context + question
stream the answer from a local Ollama model
The CLI prints retrieved chunks and relevance scores before the answer, so the retrieval step is always visible and auditable.
| Component | Library | Notes |
|---|---|---|
| Text extraction | pypdf, python-docx |
Handles .pdf, .docx, .txt |
| Embedding | sentence-transformers |
all-MiniLM-L6-v2, runs fully local |
| Vector store | chromadb |
Persisted to chroma_db/ |
| LLM | Ollama | Runs locally; model is configurable |
# macOS and Linux
curl -fsSL https://ollama.com/install.sh | sh
# Pull a model (pick one)
ollama pull gemma3 # recommended
ollama pull llama3.2 # alternative
# Start the server (keep this terminal open)
ollama serveFor Windows, download the installer from ollama.com.
python -m venv .venv
# macOS and Linux
source .venv/bin/activate
# Windows
.venv\Scripts\activate
pip install -r requirements.txtPlace .txt, .pdf, or .docx files into the docs/ folder. The pipeline reads everything in that directory.
python ingest.pyThis extracts, chunks, embeds, and stores all documents into ChromaDB. Only needs to be re-run when documents are added or changed.
# Interactive mode
python query.py
# Pass a question directly
python query.py -q "What is the relationship between privacy and autonomy?"All parameters are defined as constants at the top of each script.
ingest.py
| Variable | Default | Description |
|---|---|---|
embed_model |
all-MiniLM-L6-v2 |
Sentence-transformer model used for embedding |
chunk_size |
500 |
Maximum characters per chunk |
chunk_overlap |
100 |
Characters shared between consecutive chunks |
min_chunk_len |
50 |
Minimum chunk length to keep |
collection_name |
ethics |
ChromaDB collection name |
query.py
| Variable | Default | Description |
|---|---|---|
ollama_model |
gemma3:1b |
Ollama model used for generation |
top_k |
3 |
Number of chunks retrieved per query |
timeout |
120 |
Seconds before the Ollama request times out |
To switch models, update ollama_model in query.py:
ollama_model = "llama3.2"rag-from-scratch/
ingest.py extraction, chunking, embedding, storage
query.py retrieval, prompt construction, generation
config.py
requirements.txt
docs/ place source documents here (gitignored)
chroma_db/ auto-created on first ingest (gitignored)
logs/ ingest.log and query.log (gitignored)
-
Character-level chunking over sentence splitting. Fixed-size overlapping chunks are simpler, deterministic, and sufficient for dense academic prose where sentence boundaries carry less structural weight than in conversational text.
-
all-MiniLM-L6-v2for embedding. A strong general-purpose retrieval model at 80MB. Fast enough to run on CPU without a GPU, and accurate enough for domain-specific retrieval. -
ChromaDB over FAISS. Persistence without serialization overhead. For a corpus of this size, approximate nearest-neighbour search is not necessary.
- Chunk boundaries are character-based, not semantic. A passage can split mid-sentence.
- No re-ranking step. Retrieved chunks are ordered by raw cosine similarity.
- Single collection per run. Querying across multiple document sets requires manual collection management.
- Ollama must be running locally before
query.pyis called.
Built as a learning project to understand how Retrieval-Augmented Generation works from the ground up.