GitHub - Shravani018/rag-from-scratch: A minimal, fully local Retrieval-Augmented Generation pipeline built from scratch. Covers document chunking, embedding with sentence-transformers, vector storage via ChromaDB, and LLM generation through Ollama.

⚙️ rag-from-scratch

A minimal, fully local Retrieval-Augmented Generation (RAG) pipeline built without frameworks. Implements document ingestion, chunking, dense embedding, vector retrieval, and LLM generation as discrete, inspectable steps. No LangChain, no cloud APIs, no hidden abstractions.

How it works

RAG grounds a language model's responses in a specific document corpus by retrieving the most semantically relevant passages at query time and conditioning generation on them.

The pipeline runs in two stages:

Ingestion (run once, or when documents change):

documents (.txt / .pdf / .docx)
    extract text
    split into overlapping character chunks
    embed each chunk with a sentence-transformer
    store chunks and embeddings in ChromaDB

Query (run any number of times):

your question
    embed with the same model
    find the top-k most similar chunks in ChromaDB
    build a prompt: system instruction + retrieved context + question
    stream the answer from a local Ollama model

The CLI prints retrieved chunks and relevance scores before the answer, so the retrieval step is always visible and auditable.

Stack

Component	Library	Notes
Text extraction	`pypdf`, `python-docx`	Handles `.pdf`, `.docx`, `.txt`
Embedding	`sentence-transformers`	`all-MiniLM-L6-v2`, runs fully local
Vector store	`chromadb`	Persisted to `chroma_db/`
LLM	Ollama	Runs locally; model is configurable

Setup

1. Install Ollama and pull a model

# macOS and Linux
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model (pick one)
ollama pull gemma3        # recommended
ollama pull llama3.2      # alternative

# Start the server (keep this terminal open)
ollama serve

For Windows, download the installer from ollama.com.

2. Create a Python environment

python -m venv .venv

# macOS and Linux
source .venv/bin/activate

# Windows
.venv\Scripts\activate

pip install -r requirements.txt

3. Add your documents

Place .txt, .pdf, or .docx files into the docs/ folder. The pipeline reads everything in that directory.

4. Ingest

python ingest.py

This extracts, chunks, embeds, and stores all documents into ChromaDB. Only needs to be re-run when documents are added or changed.

5. Query

# Interactive mode
python query.py

# Pass a question directly
python query.py -q "What is the relationship between privacy and autonomy?"

Configuration

All parameters are defined as constants at the top of each script.

ingest.py

Variable	Default	Description
`embed_model`	`all-MiniLM-L6-v2`	Sentence-transformer model used for embedding
`chunk_size`	`500`	Maximum characters per chunk
`chunk_overlap`	`100`	Characters shared between consecutive chunks
`min_chunk_len`	`50`	Minimum chunk length to keep
`collection_name`	`ethics`	ChromaDB collection name

query.py

Variable	Default	Description
`ollama_model`	`gemma3:1b`	Ollama model used for generation
`top_k`	`3`	Number of chunks retrieved per query
`timeout`	`120`	Seconds before the Ollama request times out

To switch models, update ollama_model in query.py:

ollama_model = "llama3.2"

Repository structure

rag-from-scratch/
    ingest.py           extraction, chunking, embedding, storage
    query.py            retrieval, prompt construction, generation
    config.py 
    requirements.txt
    docs/               place source documents here (gitignored)
    chroma_db/          auto-created on first ingest (gitignored)
    logs/               ingest.log and query.log (gitignored)

Design decisions

Character-level chunking over sentence splitting. Fixed-size overlapping chunks are simpler, deterministic, and sufficient for dense academic prose where sentence boundaries carry less structural weight than in conversational text.
all-MiniLM-L6-v2 for embedding. A strong general-purpose retrieval model at 80MB. Fast enough to run on CPU without a GPU, and accurate enough for domain-specific retrieval.
ChromaDB over FAISS. Persistence without serialization overhead. For a corpus of this size, approximate nearest-neighbour search is not necessary.

Limitations

Chunk boundaries are character-based, not semantic. A passage can split mid-sentence.
No re-ranking step. Retrieved chunks are ordered by raw cosine similarity.
Single collection per run. Querying across multiple document sets requires manual collection management.
Ollama must be running locally before query.py is called.

Built as a learning project to understand how Retrieval-Augmented Generation works from the ground up.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.gitignore		.gitignore
README.md		README.md
config.py		config.py
ingest.py		ingest.py
query.py		query.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚙️ rag-from-scratch

How it works

Stack

Setup

1. Install Ollama and pull a model

2. Create a Python environment

3. Add your documents

4. Ingest

5. Query

Configuration

Repository structure

Design decisions

Limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

⚙️ rag-from-scratch

How it works

Stack

Setup

1. Install Ollama and pull a model

2. Create a Python environment

3. Add your documents

4. Ingest

5. Query

Configuration

Repository structure

Design decisions

Limitations

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages