Skip to content

Shravani018/rag-from-scratch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

⚙️ rag-from-scratch

A minimal, fully local Retrieval-Augmented Generation (RAG) pipeline built without frameworks. Implements document ingestion, chunking, dense embedding, vector retrieval, and LLM generation as discrete, inspectable steps. No LangChain, no cloud APIs, no hidden abstractions.


How it works

RAG grounds a language model's responses in a specific document corpus by retrieving the most semantically relevant passages at query time and conditioning generation on them.

The pipeline runs in two stages:

Ingestion (run once, or when documents change):

documents (.txt / .pdf / .docx)
    extract text
    split into overlapping character chunks
    embed each chunk with a sentence-transformer
    store chunks and embeddings in ChromaDB

Query (run any number of times):

your question
    embed with the same model
    find the top-k most similar chunks in ChromaDB
    build a prompt: system instruction + retrieved context + question
    stream the answer from a local Ollama model

The CLI prints retrieved chunks and relevance scores before the answer, so the retrieval step is always visible and auditable.


Stack

Component Library Notes
Text extraction pypdf, python-docx Handles .pdf, .docx, .txt
Embedding sentence-transformers all-MiniLM-L6-v2, runs fully local
Vector store chromadb Persisted to chroma_db/
LLM Ollama Runs locally; model is configurable

Setup

1. Install Ollama and pull a model
# macOS and Linux
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model (pick one)
ollama pull gemma3        # recommended
ollama pull llama3.2      # alternative

# Start the server (keep this terminal open)
ollama serve

For Windows, download the installer from ollama.com.

2. Create a Python environment
python -m venv .venv

# macOS and Linux
source .venv/bin/activate

# Windows
.venv\Scripts\activate

pip install -r requirements.txt
3. Add your documents

Place .txt, .pdf, or .docx files into the docs/ folder. The pipeline reads everything in that directory.

4. Ingest
python ingest.py

This extracts, chunks, embeds, and stores all documents into ChromaDB. Only needs to be re-run when documents are added or changed.

5. Query
# Interactive mode
python query.py

# Pass a question directly
python query.py -q "What is the relationship between privacy and autonomy?"

Configuration

All parameters are defined as constants at the top of each script.

ingest.py

Variable Default Description
embed_model all-MiniLM-L6-v2 Sentence-transformer model used for embedding
chunk_size 500 Maximum characters per chunk
chunk_overlap 100 Characters shared between consecutive chunks
min_chunk_len 50 Minimum chunk length to keep
collection_name ethics ChromaDB collection name

query.py

Variable Default Description
ollama_model gemma3:1b Ollama model used for generation
top_k 3 Number of chunks retrieved per query
timeout 120 Seconds before the Ollama request times out

To switch models, update ollama_model in query.py:

ollama_model = "llama3.2"

Repository structure

rag-from-scratch/
    ingest.py           extraction, chunking, embedding, storage
    query.py            retrieval, prompt construction, generation
    config.py 
    requirements.txt
    docs/               place source documents here (gitignored)
    chroma_db/          auto-created on first ingest (gitignored)
    logs/               ingest.log and query.log (gitignored)

Design decisions

  1. Character-level chunking over sentence splitting. Fixed-size overlapping chunks are simpler, deterministic, and sufficient for dense academic prose where sentence boundaries carry less structural weight than in conversational text.

  2. all-MiniLM-L6-v2 for embedding. A strong general-purpose retrieval model at 80MB. Fast enough to run on CPU without a GPU, and accurate enough for domain-specific retrieval.

  3. ChromaDB over FAISS. Persistence without serialization overhead. For a corpus of this size, approximate nearest-neighbour search is not necessary.


Limitations

  • Chunk boundaries are character-based, not semantic. A passage can split mid-sentence.
  • No re-ranking step. Retrieved chunks are ordered by raw cosine similarity.
  • Single collection per run. Querying across multiple document sets requires manual collection management.
  • Ollama must be running locally before query.py is called.

Built as a learning project to understand how Retrieval-Augmented Generation works from the ground up.

About

A minimal, fully local Retrieval-Augmented Generation pipeline built from scratch. Covers document chunking, embedding with sentence-transformers, vector storage via ChromaDB, and LLM generation through Ollama.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages