A Retrieval-Augmented Generation (RAG) chatbot that lets employees query company policy documents using natural language — filtered by department, region, policy type, and year.
Built with LangChain · ChromaDB · Gemini · Streamlit · BAAI/bge-small-en-v1.5
rag-policy-chatbot/
├── data/
│ ├── india_leave_policy_2024.pdf
│ ├── us_expense_policy_2024.pdf
│ └── metadata_manifest.json # Maps filenames → metadata
├── chroma_db/ # Auto-generated vector store (gitignored)
├── src/
│ ├── __init__.py
│ ├── config.py # All constants and env vars
│ ├── embeddings.py # Embedding model setup
│ ├── ingest.py # Indexing pipeline
│ ├── retriever.py # Filtered retrieval logic
│ └── chain.py # LangChain QA chain
├── tests/
│ ├── test_embedding.py
│ ├── test_retriever.py
│ └── test_end_to_end.py
├── app.py # Streamlit frontend
├── check.py # Quick ChromaDB health check
├── pytest.ini
├── requirement.txt
├── .gitignore
└── README.md
| Component | Tool |
|---|---|
| Orchestration | LangChain |
| Vector Database | ChromaDB |
| LLM | Gemini 2.5 Flash (Google) |
| Embedding Model | BAAI/bge-small-en-v1.5 |
| Frontend | Streamlit |
| Language | Python 3.10+ |
git clone https://github.com/your-username/rag-policy-chatbot.git
cd rag-policy-chatbotpython -m venv venv
# macOS / Linux
source venv/bin/activate
# Windows
venv\Scripts\activatepip install -r requirement.txtCreate a .env file in the project root:
GOOGLE_API_KEY=your_google_gemini_api_key_hereGet your free Gemini API key at aistudio.google.com
This reads the PDFs, chunks them, embeds them, and stores everything in ChromaDB. Run this once before starting the app (and again whenever you add new documents).
python -m src.ingestExpected output:
Loading: data/india_leave_policy_2024.pdf
Loading: data/us_expense_policy_2024.pdf
Total chunks created: 87
Index built! 87 vectors stored.
streamlit run app.pyOpen http://localhost:8501 in your browser.
Question: What is the leave policy for employees in India?
Filters selected:
- Department:
HR - Region:
India - Policy Type:
Leave - Year:
2024
Answer: The chatbot retrieves only India HR Leave policy chunks and generates a grounded answer with source citations.
-
Place the PDF or DOCX file in the
data/folder. -
Add an entry to
data/metadata_manifest.json:
{
"filename": "uk_data_privacy_2024.pdf",
"department": "Legal",
"region": "UK",
"policy_type": "Data Privacy",
"effective_year": 2024
}- Re-run the ingestion pipeline:
python -m src.ingestSupported file formats: .pdf, .txt, .md, .docx
If you want to quickly check that ChromaDB was populated correctly:
python check.pyThis prints the total number of stored chunks and a sample document with its metadata.
pytest tests/| Test File | What it Tests |
|---|---|
test_embedding.py |
Embedding shape (384-dim) and semantic similarity |
test_retriever.py |
Filter construction logic (single, multiple, none) |
test_end_to_end.py |
Full pipeline: retrieval + LLM + source metadata validation |
Note:
test_end_to_end.pyrequires the ChromaDB index to be built (python -m src.ingest) and a validGOOGLE_API_KEYin.envbefore running.
PDF / DOCX files
↓
Load with LangChain DocumentLoaders
↓
Split into chunks (500 tokens, 50 overlap)
↓
Attach metadata (department, region, policy_type, year)
↓
Embed with BAAI/bge-small-en-v1.5
↓
Store in ChromaDB (persisted to disk)
User question + sidebar filters
↓
Build ChromaDB where-clause filter
↓
Semantic search (MMR) on filtered chunks
↓
Top-5 chunks passed as context to Gemini
↓
Grounded answer + source citations → Streamlit UI
All settings are in src/config.py:
| Variable | Default | Description |
|---|---|---|
CHUNK_SIZE |
500 |
Tokens per document chunk |
OVERLAP_SIZE |
50 |
Overlap between consecutive chunks |
TOP_K |
5 |
Number of chunks retrieved per query |
EMBED_MODEL |
BAAI/bge-small-en-v1.5 |
HuggingFace embedding model |
LLM_MODEL |
gemini-2.5-flash |
Gemini model name |
LLM_TEMPERATURE |
0.1 |
Lower = more factual responses |
COLLECTION_NAME |
policy_documents |
ChromaDB collection name |
- Push your project to a public GitHub repository (ChromaDB index included, or add an init step).
- Go to share.streamlit.io and sign in with GitHub.
- Click New app → select your repo → set main file to
app.py. - Under Advanced settings → Secrets, add:
GOOGLE_API_KEY = "your_key_here" - Click Deploy.
404 NOT_FOUND error from Gemini
→ The model name is wrong. Check LLM_MODEL in config.py. Valid values: gemini-2.5-flash, gemini-2.5-pro. Run python -c "import google.generativeai as g; g.configure(api_key='YOUR_KEY'); [print(m.name) for m in g.list_models()]" to list all available models for your key.
AssertionError: Should return source documents
→ The ChromaDB index is empty or not built yet. Run python -m src.ingest first, then re-run tests.
Metadata filter returns no results
→ Metadata keys must match exactly (case-sensitive). Run python check.py and inspect the printed metadata to confirm the exact key names stored in ChromaDB.
HuggingFace model download is slow
→ The embedding model (~130MB) downloads once on first run and is cached in ~/.cache/huggingface/. Subsequent runs are instant.
MIT License — feel free to use, modify, and distribute.