RAG Document Assistant

A Retrieval-Augmented Generation (RAG) Document Assistant that enables users to upload documents, index them in a vector database, and query them using natural language to receive AI-generated answers with source citations.

Problem Statement

Traditional document search relies on keyword matching, which fails to understand the semantic meaning of queries. Users often have large collections of documents (PDFs, text files, markdown) and need to quickly find relevant information without manually reading through each document.

Challenges addressed by this project:

Semantic Understanding: Keyword search cannot understand that "machine learning" and "ML algorithms" are related concepts. Users need a system that understands meaning, not just matches words.
Multi-Document Search: When information is spread across multiple documents, users need a unified search that can find and synthesize relevant passages from all sources.
Explainable Answers: Users need to verify the accuracy of AI-generated answers by seeing the exact source passages used to generate them.
Scalable Vector Storage: Efficiently storing and querying high-dimensional embedding vectors requires a specialized vector database that can handle similarity search at scale.

Project Overview

The RAG Document Assistant solves these challenges by implementing a complete RAG pipeline:

Document Ingestion: Accepts PDF, TXT, and Markdown files. Extracts text content and splits it into semantically meaningful chunks.
Embedding Generation: Converts text chunks into 384-dimensional vectors using the all-MiniLM-L6-v2 sentence transformer model.
Vector Storage: Stores embeddings in Endee Vector Database for efficient similarity search.
Semantic Retrieval: When a user asks a question, the system finds the most semantically similar chunks using cosine similarity.
Answer Generation: Uses Google Gemini API (or local Mistral model) to generate a coherent answer based on the retrieved context.
Source Attribution: Displays the retrieved chunks with relevance scores so users can verify the answer.

Key Features:

Multi-format document support (PDF, TXT, Markdown)
Semantic search using vector similarity
AI-powered answer generation with source citations
Configurable LLM backend (Gemini API or local Mistral)
Docker-ready deployment
Pickle fallback storage for Streamlit Cloud

System Architecture

                                    User Interface (Streamlit)
                                              |
                    +-------------------------+-------------------------+
                    |                         |                         |
              Document Upload            Query Input              Answer Display
                    |                         |                         |
                    v                         v                         ^
            +---------------+         +---------------+         +---------------+
            |   Ingestion   |         |   Retrieval   |         |  Generation   |
            |    Module     |         |    Module     |         |    Module     |
            +---------------+         +---------------+         +---------------+
                    |                         |                         |
                    |    +-----------+        |                         |
                    +--->| Embedding |<-------+                         |
                         |   Module  |                                  |
                         +-----------+                                  |
                               |                                        |
                               v                                        |
                    +---------------------+                             |
                    |   Endee Vector DB   |                             |
                    | (or Pickle Fallback)|                             |
                    +---------------------+                             |
                                                                        |
                                                               +--------+--------+
                                                               |   Gemini API    |
                                                               | (or Local LLM)  |
                                                               +-----------------+

Data Flow:

Ingestion Flow: User uploads document → Text extraction → Chunking → Embedding generation → Vector storage in Endee
Query Flow: User enters question → Query embedding → Vector similarity search → Top-K retrieval → Prompt construction → LLM generation → Answer display with sources

Technical Approach

Document Processing

Documents are processed through a multi-stage pipeline:

Text Extraction:
- PDF files: Extracted using PyPDF2, preserving page numbers
- Markdown files: Converted to HTML, then stripped to plain text
- Text files: Read directly with UTF-8 encoding
Chunking Strategy:
- Fixed-size chunks of 500 tokens with 50-token overlap
- Overlap ensures context is not lost at chunk boundaries
- Each chunk retains metadata (source document, page number, chunk index)

Embedding Model

The system uses the all-MiniLM-L6-v2 sentence transformer model:

Dimension: 384-dimensional vectors
Speed: Fast inference suitable for real-time applications
Quality: Optimized for semantic similarity tasks
Memory: ~80MB model size, suitable for CPU inference

Vector Search

Similarity search uses cosine similarity:

similarity = (A · B) / (||A|| × ||B||)

Where A is the query embedding and B is a stored document embedding. Higher similarity scores indicate more relevant content.

RAG Prompt Construction

The prompt sent to the LLM follows this structure:

You are a helpful assistant. Answer the question based ONLY on the provided context.

CONTEXT:
[Source 1: document_name.pdf, Page 3]
<chunk text>

[Source 2: document_name.pdf, Page 7]
<chunk text>

QUESTION: <user question>

INSTRUCTIONS:
- Answer based only on the provided context
- If the answer is not in the context, say so
- Cite your sources

LLM Integration

Two LLM backends are supported:

Google Gemini API (Recommended):
- Model: gemini-2.0-flash
- Fast, cost-effective, high-quality responses
- Requires API key from Google AI Studio
Local Mistral Model (Optional):
- Model: mistralai/Mistral-7B-Instruct-v0.2
- Runs locally, no API costs
- Requires ~14GB RAM

How Endee Vector Database is Used

Endee is a high-performance vector database designed for similarity search. This project uses Endee for:

1. Index Management

An index named rag_documents is created to store document embeddings:

client.create_index(
    name="rag_documents",
    dimension=384,           # Matches embedding model output
    space_type="cosine",     # Cosine similarity metric
    precision=Precision.FLOAT32
)

2. Vector Storage (Upsert)

When documents are indexed, each chunk is stored with its embedding and metadata:

index.upsert([
    {
        "id": "chunk_uuid",
        "vector": [0.1, 0.2, ...],  # 384-dim embedding
        "meta": {
            "text": "Original chunk text...",
            "document_name": "report.pdf",
            "page_number": 5,
            "chunk_index": 12
        }
    }
])

3. Similarity Search (Query)

When a user asks a question, Endee performs efficient similarity search:

results = index.query(
    vector=query_embedding,  # User question embedding
    top_k=4                  # Return 4 most similar chunks
)

Returns results sorted by similarity score with full metadata for source attribution.

4. Fallback Storage

For environments without Endee (like Streamlit Cloud), the system automatically falls back to a pickle-based vector store that implements the same interface using NumPy for cosine similarity calculations.

Setup and Installation

Prerequisites

Python 3.10 or higher
Docker (for running Endee locally)
Gemini API Key (get from https://aistudio.google.com/app/apikey)

Step 1: Clone the Repository

git clone https://github.com/yourusername/EndeeProject.git
cd EndeeProject

Step 2: Start Endee Vector Database

Run Endee in Docker:

docker run -d -p 8080:8080 -v endee-data:/data --name endee-server endeeio/endee-server:latest

Verify Endee is running:

curl http://localhost:8080/health

Expected response: {"status":"ok"}

Step 3: Create Python Virtual Environment

Windows:

python -m venv venv
venv\Scripts\activate

Linux/Mac:

python -m venv venv
source venv/bin/activate

Step 4: Install Dependencies

pip install -r requirements.txt

Step 5: Configure Environment Variables

Copy the example environment file:

Windows:

copy .env.example .env

Linux/Mac:

cp .env.example .env

Edit .env with your configuration:

# Endee Configuration
ENDEE_HOST=localhost
ENDEE_PORT=8080

# Storage Backend (set to true for Streamlit Cloud)
USE_PICKLE_STORAGE=false

# LLM Configuration
USE_LOCAL_LLM=false
GEMINI_API_KEY=your_gemini_api_key_here
GEMINI_MODEL=gemini-2.0-flash

# Application Settings
MAX_FILE_SIZE_MB=50
TOP_K_RETRIEVAL=4

Running the Application

Local Development

Start the Streamlit application:

streamlit run app/main.py

The application will be available at: http://localhost:8501

Using Docker Compose

For a complete containerized deployment:

docker-compose up --build

This starts both Endee and the RAG application.

Docker Commands Reference

# Start in background
docker-compose up -d

# View application logs
docker-compose logs -f rag-app

# Stop all services
docker-compose down

# Stop and remove all data
docker-compose down -v

Deployment on Streamlit Cloud

Streamlit Cloud cannot run Docker containers, so we provide a pickle-based fallback storage that works without Endee.

Step 1: Prepare Repository

Ensure your code is pushed to GitHub:

git add .
git commit -m "Prepare for Streamlit Cloud"
git push origin main

Step 2: Configure Secrets

Create .streamlit/secrets.toml in your repository (add to .gitignore):

USE_PICKLE_STORAGE = "true"
GEMINI_API_KEY = "your-gemini-api-key"
GEMINI_MODEL = "gemini-2.0-flash"
USE_LOCAL_LLM = "false"

Step 3: Deploy

Go to https://share.streamlit.io
Click "New app"
Connect your GitHub repository
Set main file path: app/main.py
Add secrets in "Advanced settings" (paste from secrets.toml)
Click "Deploy"

Important Notes

Pickle storage is stored in memory and resets when the app redeploys
For persistent storage, deploy Endee on a cloud VM and configure ENDEE_HOST
Local Mistral model is not available on Streamlit Cloud due to memory limits

Configuration Reference

Variable	Default	Description
`ENDEE_HOST`	`localhost`	Endee server hostname
`ENDEE_PORT`	`8080`	Endee server port
`USE_PICKLE_STORAGE`	`false`	Use pickle file storage instead of Endee
`USE_LOCAL_LLM`	`false`	Use local Mistral model instead of Gemini
`GEMINI_API_KEY`	-	Google Gemini API key (required if not using local LLM)
`GEMINI_MODEL`	`gemini-2.0-flash`	Gemini model identifier
`MAX_FILE_SIZE_MB`	`50`	Maximum uploaded file size in megabytes
`TOP_K_RETRIEVAL`	`4`	Number of chunks to retrieve per query

Project Structure

EndeeProject/
├── app/
│   ├── main.py          # Streamlit user interface
│   ├── ingestion.py     # Document processing and chunking
│   ├── embedding.py     # Sentence transformer embedding generation
│   ├── retrieval.py     # Vector storage and search (Endee/Pickle)
│   ├── generation.py    # LLM integration (Gemini/Mistral)
│   └── utils.py         # Utilities, logging, error handling
├── config/
│   └── settings.py      # Configuration management using pydantic
├── data/                # Uploaded documents and pickle store
├── models/              # Cached sentence transformer models
├── logs/                # Application logs
├── .env.example         # Environment variable template
├── requirements.txt     # Python dependencies
├── Dockerfile           # Application container definition
├── docker-compose.yml   # Multi-container orchestration
└── README.md            # This documentation

Troubleshooting

Endee Connection Failed

Error: Failed to connect to Endee at localhost:8080

Solution: Verify Endee is running:

docker ps | grep endee

If not running:

docker run -d -p 8080:8080 -v endee-data:/data --name endee-server endeeio/endee-server:latest

Gemini API Quota Exceeded (HTTP 429)

Error: You exceeded your current quota

Solution:

Wait 30-60 seconds and retry
Use "Sources Only" mode for testing without LLM calls
Check quota at https://aistudio.google.com

Model Not Found (HTTP 404)

Error: models/gemini-1.5-flash is not found

Solution: Update GEMINI_MODEL in .env to gemini-2.0-flash

Empty Search Results

Cause: No documents have been indexed.

Solution: Upload and index documents using the sidebar before querying.

Out of Memory (Local LLM)

Error: Application crashes when using Local Mistral

Solution: Local Mistral requires ~14GB RAM. Use Gemini API instead by setting USE_LOCAL_LLM=false.

License

This project is licensed under the MIT License.

Acknowledgments

Endee (https://endee.io) - High-performance vector database
Streamlit (https://streamlit.io) - Python web application framework
Google Gemini (https://ai.google.dev) - Large language model API
Sentence Transformers (https://www.sbert.net) - Embedding models

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.devcontainer		.devcontainer
app		app
config		config
docs		docs
endee		endee
.env.example		.env.example
.gitignore		.gitignore
.gitmodules		.gitmodules
3.0.0		3.0.0
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
docker-entrypoint.sh		docker-entrypoint.sh
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

RAG Document Assistant

Table of Contents

Problem Statement

Project Overview

System Architecture

Technical Approach

Document Processing

Embedding Model

Vector Search

RAG Prompt Construction

LLM Integration

How Endee Vector Database is Used

1. Index Management

2. Vector Storage (Upsert)

3. Similarity Search (Query)

4. Fallback Storage

Setup and Installation

Prerequisites

Step 1: Clone the Repository

Step 2: Start Endee Vector Database

Step 3: Create Python Virtual Environment

Step 4: Install Dependencies

Step 5: Configure Environment Variables

Running the Application

Local Development

Using Docker Compose

Docker Commands Reference

Deployment on Streamlit Cloud

Step 1: Prepare Repository

Step 2: Configure Secrets

Step 3: Deploy

Important Notes

Configuration Reference

Project Structure

Troubleshooting

Endee Connection Failed

Gemini API Quota Exceeded (HTTP 429)

Model Not Found (HTTP 404)

Empty Search Results

Out of Memory (Local LLM)

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages