A Retrieval-Augmented Generation (RAG) Document Assistant that enables users to upload documents, index them in a vector database, and query them using natural language to receive AI-generated answers with source citations.
- Problem Statement
- Project Overview
- System Architecture
- Technical Approach
- How Endee Vector Database is Used
- Setup and Installation
- Running the Application
- Deployment on Streamlit Cloud
- Configuration Reference
- Project Structure
- Troubleshooting
Traditional document search relies on keyword matching, which fails to understand the semantic meaning of queries. Users often have large collections of documents (PDFs, text files, markdown) and need to quickly find relevant information without manually reading through each document.
Challenges addressed by this project:
-
Semantic Understanding: Keyword search cannot understand that "machine learning" and "ML algorithms" are related concepts. Users need a system that understands meaning, not just matches words.
-
Multi-Document Search: When information is spread across multiple documents, users need a unified search that can find and synthesize relevant passages from all sources.
-
Explainable Answers: Users need to verify the accuracy of AI-generated answers by seeing the exact source passages used to generate them.
-
Scalable Vector Storage: Efficiently storing and querying high-dimensional embedding vectors requires a specialized vector database that can handle similarity search at scale.
The RAG Document Assistant solves these challenges by implementing a complete RAG pipeline:
-
Document Ingestion: Accepts PDF, TXT, and Markdown files. Extracts text content and splits it into semantically meaningful chunks.
-
Embedding Generation: Converts text chunks into 384-dimensional vectors using the
all-MiniLM-L6-v2sentence transformer model. -
Vector Storage: Stores embeddings in Endee Vector Database for efficient similarity search.
-
Semantic Retrieval: When a user asks a question, the system finds the most semantically similar chunks using cosine similarity.
-
Answer Generation: Uses Google Gemini API (or local Mistral model) to generate a coherent answer based on the retrieved context.
-
Source Attribution: Displays the retrieved chunks with relevance scores so users can verify the answer.
Key Features:
- Multi-format document support (PDF, TXT, Markdown)
- Semantic search using vector similarity
- AI-powered answer generation with source citations
- Configurable LLM backend (Gemini API or local Mistral)
- Docker-ready deployment
- Pickle fallback storage for Streamlit Cloud
User Interface (Streamlit)
|
+-------------------------+-------------------------+
| | |
Document Upload Query Input Answer Display
| | |
v v ^
+---------------+ +---------------+ +---------------+
| Ingestion | | Retrieval | | Generation |
| Module | | Module | | Module |
+---------------+ +---------------+ +---------------+
| | |
| +-----------+ | |
+--->| Embedding |<-------+ |
| Module | |
+-----------+ |
| |
v |
+---------------------+ |
| Endee Vector DB | |
| (or Pickle Fallback)| |
+---------------------+ |
|
+--------+--------+
| Gemini API |
| (or Local LLM) |
+-----------------+
Data Flow:
-
Ingestion Flow: User uploads document → Text extraction → Chunking → Embedding generation → Vector storage in Endee
-
Query Flow: User enters question → Query embedding → Vector similarity search → Top-K retrieval → Prompt construction → LLM generation → Answer display with sources
Documents are processed through a multi-stage pipeline:
-
Text Extraction:
- PDF files: Extracted using PyPDF2, preserving page numbers
- Markdown files: Converted to HTML, then stripped to plain text
- Text files: Read directly with UTF-8 encoding
-
Chunking Strategy:
- Fixed-size chunks of 500 tokens with 50-token overlap
- Overlap ensures context is not lost at chunk boundaries
- Each chunk retains metadata (source document, page number, chunk index)
The system uses the all-MiniLM-L6-v2 sentence transformer model:
- Dimension: 384-dimensional vectors
- Speed: Fast inference suitable for real-time applications
- Quality: Optimized for semantic similarity tasks
- Memory: ~80MB model size, suitable for CPU inference
Similarity search uses cosine similarity:
similarity = (A · B) / (||A|| × ||B||)
Where A is the query embedding and B is a stored document embedding. Higher similarity scores indicate more relevant content.
The prompt sent to the LLM follows this structure:
You are a helpful assistant. Answer the question based ONLY on the provided context.
CONTEXT:
[Source 1: document_name.pdf, Page 3]
<chunk text>
[Source 2: document_name.pdf, Page 7]
<chunk text>
QUESTION: <user question>
INSTRUCTIONS:
- Answer based only on the provided context
- If the answer is not in the context, say so
- Cite your sources
Two LLM backends are supported:
-
Google Gemini API (Recommended):
- Model:
gemini-2.0-flash - Fast, cost-effective, high-quality responses
- Requires API key from Google AI Studio
- Model:
-
Local Mistral Model (Optional):
- Model:
mistralai/Mistral-7B-Instruct-v0.2 - Runs locally, no API costs
- Requires ~14GB RAM
- Model:
Endee is a high-performance vector database designed for similarity search. This project uses Endee for:
An index named rag_documents is created to store document embeddings:
client.create_index(
name="rag_documents",
dimension=384, # Matches embedding model output
space_type="cosine", # Cosine similarity metric
precision=Precision.FLOAT32
)When documents are indexed, each chunk is stored with its embedding and metadata:
index.upsert([
{
"id": "chunk_uuid",
"vector": [0.1, 0.2, ...], # 384-dim embedding
"meta": {
"text": "Original chunk text...",
"document_name": "report.pdf",
"page_number": 5,
"chunk_index": 12
}
}
])When a user asks a question, Endee performs efficient similarity search:
results = index.query(
vector=query_embedding, # User question embedding
top_k=4 # Return 4 most similar chunks
)Returns results sorted by similarity score with full metadata for source attribution.
For environments without Endee (like Streamlit Cloud), the system automatically falls back to a pickle-based vector store that implements the same interface using NumPy for cosine similarity calculations.
- Python 3.10 or higher
- Docker (for running Endee locally)
- Gemini API Key (get from https://aistudio.google.com/app/apikey)
git clone https://github.com/yourusername/EndeeProject.git
cd EndeeProjectRun Endee in Docker:
docker run -d -p 8080:8080 -v endee-data:/data --name endee-server endeeio/endee-server:latestVerify Endee is running:
curl http://localhost:8080/healthExpected response: {"status":"ok"}
Windows:
python -m venv venv
venv\Scripts\activateLinux/Mac:
python -m venv venv
source venv/bin/activatepip install -r requirements.txtCopy the example environment file:
Windows:
copy .env.example .envLinux/Mac:
cp .env.example .envEdit .env with your configuration:
# Endee Configuration
ENDEE_HOST=localhost
ENDEE_PORT=8080
# Storage Backend (set to true for Streamlit Cloud)
USE_PICKLE_STORAGE=false
# LLM Configuration
USE_LOCAL_LLM=false
GEMINI_API_KEY=your_gemini_api_key_here
GEMINI_MODEL=gemini-2.0-flash
# Application Settings
MAX_FILE_SIZE_MB=50
TOP_K_RETRIEVAL=4Start the Streamlit application:
streamlit run app/main.pyThe application will be available at: http://localhost:8501
For a complete containerized deployment:
docker-compose up --buildThis starts both Endee and the RAG application.
# Start in background
docker-compose up -d
# View application logs
docker-compose logs -f rag-app
# Stop all services
docker-compose down
# Stop and remove all data
docker-compose down -vStreamlit Cloud cannot run Docker containers, so we provide a pickle-based fallback storage that works without Endee.
Ensure your code is pushed to GitHub:
git add .
git commit -m "Prepare for Streamlit Cloud"
git push origin mainCreate .streamlit/secrets.toml in your repository (add to .gitignore):
USE_PICKLE_STORAGE = "true"
GEMINI_API_KEY = "your-gemini-api-key"
GEMINI_MODEL = "gemini-2.0-flash"
USE_LOCAL_LLM = "false"- Go to https://share.streamlit.io
- Click "New app"
- Connect your GitHub repository
- Set main file path:
app/main.py - Add secrets in "Advanced settings" (paste from secrets.toml)
- Click "Deploy"
- Pickle storage is stored in memory and resets when the app redeploys
- For persistent storage, deploy Endee on a cloud VM and configure ENDEE_HOST
- Local Mistral model is not available on Streamlit Cloud due to memory limits
| Variable | Default | Description |
|---|---|---|
ENDEE_HOST |
localhost |
Endee server hostname |
ENDEE_PORT |
8080 |
Endee server port |
USE_PICKLE_STORAGE |
false |
Use pickle file storage instead of Endee |
USE_LOCAL_LLM |
false |
Use local Mistral model instead of Gemini |
GEMINI_API_KEY |
- | Google Gemini API key (required if not using local LLM) |
GEMINI_MODEL |
gemini-2.0-flash |
Gemini model identifier |
MAX_FILE_SIZE_MB |
50 |
Maximum uploaded file size in megabytes |
TOP_K_RETRIEVAL |
4 |
Number of chunks to retrieve per query |
EndeeProject/
├── app/
│ ├── main.py # Streamlit user interface
│ ├── ingestion.py # Document processing and chunking
│ ├── embedding.py # Sentence transformer embedding generation
│ ├── retrieval.py # Vector storage and search (Endee/Pickle)
│ ├── generation.py # LLM integration (Gemini/Mistral)
│ └── utils.py # Utilities, logging, error handling
├── config/
│ └── settings.py # Configuration management using pydantic
├── data/ # Uploaded documents and pickle store
├── models/ # Cached sentence transformer models
├── logs/ # Application logs
├── .env.example # Environment variable template
├── requirements.txt # Python dependencies
├── Dockerfile # Application container definition
├── docker-compose.yml # Multi-container orchestration
└── README.md # This documentation
Error: Failed to connect to Endee at localhost:8080
Solution: Verify Endee is running:
docker ps | grep endeeIf not running:
docker run -d -p 8080:8080 -v endee-data:/data --name endee-server endeeio/endee-server:latestError: You exceeded your current quota
Solution:
- Wait 30-60 seconds and retry
- Use "Sources Only" mode for testing without LLM calls
- Check quota at https://aistudio.google.com
Error: models/gemini-1.5-flash is not found
Solution: Update GEMINI_MODEL in .env to gemini-2.0-flash
Cause: No documents have been indexed.
Solution: Upload and index documents using the sidebar before querying.
Error: Application crashes when using Local Mistral
Solution: Local Mistral requires ~14GB RAM. Use Gemini API instead by setting USE_LOCAL_LLM=false.
This project is licensed under the MIT License.
- Endee (https://endee.io) - High-performance vector database
- Streamlit (https://streamlit.io) - Python web application framework
- Google Gemini (https://ai.google.dev) - Large language model API
- Sentence Transformers (https://www.sbert.net) - Embedding models