RAG Chatbot — Production AI Document Intelligence

A production-grade Retrieval-Augmented Generation (RAG) system with query classification, Redis caching, a REST API, hallucination guardrails, citation-grounded answers, and real-time metrics — containerized with Docker and deployed on AWS EC2.

Live Demo

Chatbot UI: chatbot.rajkumarai.dev

REST API: chatbot.rajkumarai.dev/api/query (POST)

Swagger Docs: chatbot.rajkumarai.dev/api/docs

Portfolio: rajkumarai.dev

Deployed on AWS EC2 (t3.micro) behind nginx with Let's Encrypt SSL.

Features

Feature	Details
Query Classification	Rule-based classifier routes each query to an optimized retrieval config (FACTUAL / COMPLEX / AMBIGUOUS / KEYWORD)
Redis Caching	Normalized query caching with SHA-256 keys, 1-hour TTL — skips caching for fallback/low-confidence responses
REST API	FastAPI service (`POST /api/query`) with Swagger UI — independently queryable without the Streamlit UI
Citation-Grounded Answers	Every response includes `[SOURCE N: filename
Hallucination Guardrails	Similarity threshold + min context length check before the LLM is called
MMR Retrieval	Maximum Marginal Relevance reduces redundant chunks
Chunk Deduplication	Deduplicates by `(source, page)` — prevents the same page appearing multiple times in context
Cost Controls	Rate limiting (20 req/min per IP), input length cap (500 chars), token budget (400 tokens) — all fire before OpenAI is called
Library Mode	Query pre-indexed PDFs with multi-document filtering
Upload Mode	Upload any PDF and query it instantly (temp-dir storage, isolated per session)
Real-Time Metrics	Session latency, token usage, cache hit rate, success rate — tracked in the UI
Structured Logging	Every query logged as JSON to `logs/rag_queries.log`

System Architecture

┌──────────────────────────────────────────────────────────────────┐
│                         USER INTERFACE                            │
│           app/streamlit_ui.py  ←→  api/rag_api.py               │
│    [Library Mode | Upload Mode | Multi-Doc Select | Metrics]     │
└────────────────────────────┬─────────────────────────────────────┘
                             │
          ┌──────────────────▼──────────────────┐
          │          api/server.py (FastAPI)      │
          │  Rate limit (20/min per IP)          │
          │  Input length guard (500 chars)      │
          │  Token budget guard (400 tokens)     │
          │  POST /query → run_query()           │
          │  GET  /health                        │
          └──────────────────┬──────────────────┘
                             │
          ┌──────────────────▼──────────────────┐
          │        cache/redis_cache.py          │
          │  SHA-256 key on normalized query     │
          │  HIT → return cached response        │
          │  MISS → continue pipeline            │
          └──────────────────┬──────────────────┘
                             │
          ┌──────────────────▼──────────────────┐
          │    retrieval/query_classifier.py     │
          │  FACTUAL / COMPLEX / AMBIGUOUS /     │
          │  KEYWORD → sets top_k, fetch_k       │
          └──────────────────┬──────────────────┘
                             │
          ┌──────────────────▼──────────────────┐
          │       retrieval/retriever.py         │
          │  MMR retrieval with per-query config │
          │  check_retrieval_confidence()        │
          └──────────────────┬──────────────────┘
                             │
          ┌──────────────────▼──────────────────┐
          │       retrieval/reranker.py          │
          │  Score filter (threshold=0.15)       │
          │  Dedup by (source, page)             │
          └──────────────────┬──────────────────┘
                             │
          ┌──────────────────▼──────────────────┐
          │     vectorstore/chroma_manager.py    │
          │  ChromaDB — persisted or temp-dir    │
          └──────────────────┬──────────────────┘
                             │
          ┌──────────────────▼──────────────────┐
          │       GPT-3.5-turbo (OpenAI)        │
          │  Strict prompt + citation format     │
          └──────────────────┬──────────────────┘
                             │
          ┌──────────────────▼──────────────────┐
          │      monitoring/logger.py            │
          │  log_query_event() → JSONL log       │
          │  monitoring/metrics.py → UI panel    │
          └─────────────────────────────────────┘

Data Flow

PDF Files
   ↓
ingestion/pdf_loader.py           ← PyPDFLoader + source metadata patching
   ↓
ingestion/chunking.py             ← RecursiveCharacterTextSplitter (1000/200)
   ↓
ingestion/embedding_pipeline.py   ← OpenAI text-embedding-3-small (1536-dim)
   ↓
vectorstore/chroma_manager.py     ← ChromaDB PersistentClient
   ↓
retrieval/query_classifier.py     ← Classifies query type, sets retrieval config
   ↓
retrieval/retriever.py            ← MMR retrieval with classified config
   ↓
retrieval/reranker.py             ← Score filter + (source, page) deduplication
   ↓  [GUARDRAIL: fallback if low confidence]
api/rag_api.py                    ← format_docs_with_metadata() + strict prompt
   ↓
GPT-3.5-turbo                     ← Grounded answer with [SOURCE N: file | Page]
   ↓
cache/redis_cache.py              ← Cache result for 1 hour
   ↓
monitoring/logger.py              ← JSON log entry to logs/rag_queries.log

Project Structure

rag-chatbot/
├── app/
│   └── streamlit_ui.py           # Streamlit UI — library, upload, metrics, welcome guide
├── api/
│   ├── rag_api.py                # Core RAG logic: run_query(), build_rag_chain()
│   ├── server.py                 # FastAPI server — rate limiting, input guards, POST /query
│   └── rate_limiter.py           # slowapi Limiter instance (Redis-backed)
├── cache/
│   └── redis_cache.py            # Redis caching with query normalization + stats
├── ingestion/
│   ├── pdf_loader.py             # PDF loading + metadata normalization
│   ├── chunking.py               # RecursiveCharacterTextSplitter
│   └── embedding_pipeline.py    # End-to-end ingest pipeline
├── retrieval/
│   ├── query_classifier.py       # Rule-based classifier → RetrievalConfig
│   ├── retriever.py              # MMR retriever + hallucination guardrails
│   └── reranker.py               # Score filtering + (source, page) deduplication
├── vectorstore/
│   └── chroma_manager.py         # ChromaDB CRUD + document listing
├── monitoring/
│   ├── logger.py                 # Structured JSON logger
│   └── metrics.py                # MetricsTracker (latency, tokens, cache hit rate)
├── evaluation/
│   └── run_rag_eval.py           # Evaluation script
├── prompts/
│   └── rag_prompt.txt            # Strict citation prompt template
├── config/
│   └── settings.py               # All config from .env
├── data/sampledocs/              # Library PDFs
├── vectorstore_data/             # ChromaDB persisted store (volume-mounted)
├── logs/                         # rag_queries.log written here
├── ingest.py                     # Top-level ingestion script
├── docker-compose.yml            # 3 services: redis, app (Streamlit), api (FastAPI)
├── Dockerfile
├── .env.example
└── requirements.txt

Setup — Local

git clone https://github.com/Rajkumar2002-Rk/rag-chatbot.git
cd rag-chatbot

python -m venv venv
source venv/bin/activate

pip install -r requirements.txt

cp .env.example .env
# Add your OPENAI_API_KEY to .env

# Add PDFs to data/sampledocs/
python ingest.py

# Run Streamlit UI
streamlit run app/streamlit_ui.py

# Run FastAPI (separate terminal)
uvicorn api.server:app --host 0.0.0.0 --port 8000

Setup — Docker Compose (Production)

cp .env.example .env
# Add your OPENAI_API_KEY to .env

docker-compose up -d

This starts three services:

redis — Redis 7 (caching)
app — Streamlit UI on port 8501
api — FastAPI on port 8000

Open http://localhost:8501 for the UI or hit http://localhost:8000/query for the API.

The vectorstore_data/ and logs/ directories are volume-mounted so data persists across restarts.

REST API

`POST /api/query`

curl -X POST https://chatbot.rajkumarai.dev/api/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What are the attention mechanisms described in the paper?"}'

Request body:

{
  "query": "string",
  "filter_docs": ["doc1.pdf", "doc2.pdf"]  // optional
}

Response:

{
  "answer": "The paper describes...",
  "sources": ["Attention Is All You Need.pdf"],
  "query_type": "COMPLEX",
  "cache_hit": false,
  "response_time_ms": 1240,
  "num_chunks": 7
}

Swagger UI: chatbot.rajkumarai.dev/api/docs

Query Classification

Each query is classified before retrieval, with a retrieval config tuned per type:

Type	top_k	fetch_k	Notes
`FACTUAL`	3	10	Short, specific lookups
`COMPLEX`	7	25	Multi-part questions
`AMBIGUOUS`	5	15	Query expansion applied
`KEYWORD`	5	15	Keyword/entity lookups

Redis Caching

Key: SHA-256 hash of normalized query + filter_docs
Normalization: lowercase, collapse whitespace, strip punctuation, strip filler prefixes ("what is", "tell me about", etc.)
TTL: 1 hour (configurable via CACHE_TTL)
Skips caching: fallback responses, low-confidence results, errors
Stats tracked: hits, misses, sets, skipped — surfaced in the UI metrics panel

Hallucination Guardrails

Three conditions trigger a fallback response instead of an LLM call:

No documents retrieved from ChromaDB
Best similarity score < SIMILARITY_THRESHOLD (default: 0.15)
Total context length < MIN_CONTEXT_LENGTH (default: 50 chars)

When triggered, the user sees a helpful message guiding them toward better queries.

Configuration

All settings in config/settings.py, overridable via .env:

Variable	Default	Description
`OPENAI_API_KEY`	—	Required
`REDIS_URL`	`redis://redis:6379/0`	Redis connection URL
`CACHE_TTL`	`3600`	Cache TTL in seconds
`CHUNK_SIZE`	`1000`	Characters per chunk
`CHUNK_OVERLAP`	`200`	Overlap between chunks
`RETRIEVAL_K`	`5`	Chunks returned to LLM
`RETRIEVAL_FETCH_K`	`20`	MMR candidates before reranking
`RETRIEVAL_LAMBDA`	`0.7`	Relevance vs diversity balance
`SIMILARITY_THRESHOLD`	`0.15`	Min score to trust a chunk
`MIN_CONTEXT_LENGTH`	`50`	Min chars of context before LLM call
`RATE_LIMIT`	`20/minute`	Max requests per IP per window
`MAX_QUERY_LENGTH`	`500`	Max query length in characters
`MAX_INPUT_TOKENS`	`400`	Max query tokens (tiktoken) before LLM
`VECTORSTORE_DIR`	`vectorstore_data/`	ChromaDB persist path

Cost Controls

Three guards fire on every /query request before OpenAI is ever called:

Guard	Limit	Response on breach
Rate limiting	20 requests/minute per IP (slowapi + Redis)	HTTP 429
Input length	500 characters max	HTTP 400
Token budget	400 tokens max (tiktoken `cl100k_base`)	HTTP 400

All limits are configurable via .env — see RATE_LIMIT, MAX_QUERY_LENGTH, MAX_INPUT_TOKENS.

Structured Logging

Every query writes a JSON line to logs/rag_queries.log:

{
  "timestamp": "2026-03-28T10:22:11Z",
  "query": "What programming languages does Raj know?",
  "query_type": "FACTUAL",
  "cache_hit": false,
  "response_time_ms": 1180,
  "retrieval_time_ms": 192,
  "num_chunks": 3,
  "source_documents": ["Raj_Resume.pdf"],
  "token_usage_estimate": 284
}

Docker Hub

Pre-built image: hub.docker.com/r/raja1566/rag-chatbot

docker pull raja1566/rag-chatbot:latest

License

MIT License

_{Built by Raj Kumar Nelluri · AI Engineer}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Chatbot — Production AI Document Intelligence

Live Demo

Features

System Architecture

Data Flow

Project Structure

Setup — Local

Setup — Docker Compose (Production)

REST API

`POST /api/query`

Query Classification

Redis Caching

Hallucination Guardrails

Configuration

Cost Controls

Structured Logging

Docker Hub

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
api		api
app		app
cache		cache
config		config
data/sampledocs		data/sampledocs
docs		docs
evaluation		evaluation
ingestion		ingestion
logs		logs
monitoring		monitoring
prompts		prompts
retrieval		retrieval
tests		tests
vectorstore		vectorstore
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
ingest.py		ingest.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

RAG Chatbot — Production AI Document Intelligence

Live Demo

Features

System Architecture

Data Flow

Project Structure

Setup — Local

Setup — Docker Compose (Production)

REST API

POST /api/query

Query Classification

Redis Caching

Hallucination Guardrails

Configuration

Cost Controls

Structured Logging

Docker Hub

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`POST /api/query`

Packages