# 📒 Munshi — Private AI Assistant for Indian CA Firms
*Munshi (मुंशी)* — the traditional Hindi term for a record-keeper or clerk. This Munshi is digital, private, and never leaves your office.
## What Is This?
Munshi is a **fully offline, on-premise AI assistant** built for Indian Chartered Accountant firms. It runs entirely on a server inside the firm's office — no client data ever leaves the premises. No GPT, no Claude, no internet calls.
Staff access Munshi through a clean web browser interface. Munshi reads client documents (GST returns, ITRs, audit reports, scanned notices), answers questions about them with proper citations, and is built to scale into automated GST notice reply drafting and other CA-specific workflows.
## Why Build This?
Indian CA firms handle highly confidential client data:
- GSTRs, ITRs, audit reports, bank statements
- Tax demand notices, scrutiny notices
- Personal financial records of clients
ICAI rules and client confidentiality agreements prevent firms from putting any of this into ChatGPT or other cloud AI services. But CAs spend hours every day searching through PDFs, drafting routine documents, and reconciling data.
Munshi solves this — a private AI inside the firm, accessible to all staff, that never sends data outside.
## Features (Current Prototype)
- ✅ **Local LLM inference** — Qwen 2.5-3B running on consumer GPU (RTX 3050 4GB tested)
- ✅ **Document Q&A** — Ask questions in natural language, get cited answers
- ✅ **OCR for scanned PDFs** — Tesseract + Poppler pipeline auto-detects scanned vs typed
- ✅ **Multi-client support** — Documents organized by client folder with isolation
- ✅ **Live document upload** — Drag-and-drop new PDFs through the UI; auto-indexed
- ✅ **Source citations** — Every answer shows source files with relevance scores
- ✅ **Branded web UI** — Clean Streamlit interface, professional appearance
- ✅ **Zero internet calls** — Verified offline operation
## Architecture
┌─────────────────────────────────────────┐
│ Browser (any office desktop or laptop) │
│ http://munshi.local │
└─────────────────┬───────────────────────┘
  │
┌─────────────────▼───────────────────────┐
│ Streamlit Web UI (munshi\_ui.py) │
│ - Chat interface │
│ - Document upload │
│ - Source citations │
└─────────────────┬───────────────────────┘
  │
┌─────────────────▼───────────────────────┐
│ RAG Pipeline (LlamaIndex) │
│ - Embeddings (BGE-small-en-v1.5) │
│ - OCR detection + Tesseract │
└─────────┬───────────────────┬───────────┘
  │ │
┌─────────▼─────────┐ ┌─────▼────────────┐
│ Qdrant Vector DB │ │ llama.cpp │
│ (Docker, port │ │ serving Qwen │
│ 6333) │ │ (port 8000) │
└───────────────────┘ └──────────────────┘
## Tech Stack
| Component | Choice | Why |
|---|---|---|
| LLM Engine | llama.cpp (native binary) | Fastest CUDA inference on consumer GPUs |
| Model | Qwen 2.5-3B-Instruct Q4_K_M | Best quality at 4GB VRAM |
| Embeddings | BAAI/bge-small-en-v1.5 | Local, fast, accurate for English |
| Vector DB | Qdrant (Docker) | Production-ready, simple ops |
| OCR | Tesseract 5.5 + Poppler | Industry standard, free |
| RAG Framework | LlamaIndex | Best document handling for our use case |
| UI | Streamlit | Fast iteration, good defaults |
## Setup
### Prerequisites
- Windows 10/11 or Linux (tested on Windows 11)
- NVIDIA GPU with CUDA support (RTX 3050 4GB minimum, RTX 4090 24GB recommended for production)
- Python 3.11
- Docker Desktop
- Tesseract OCR 5.5+
- Git
### Step 1 — Clone The Repo
git clone https://github.com/poojithdevan4D/Munshi.git
cd Munshi
### Step 2 — Set Up Python Environment
python -m venv venv
.\\venv\\Scripts\\activate
pip install -r requirements.txt
### Step 3 — Download The Model
Download [Qwen 2.5-3B-Instruct Q4_K_M GGUF](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct-GGUF/blob/main/qwen2.5-3b-instruct-q4\_k\_m.gguf) and place in models/.
### Step 4 — Build llama.cpp With CUDA
Follow [llama.cpp build instructions](https://github.com/ggerganov/llama.cpp). Place the binary in llama-server/.
### Step 5 — Start Qdrant
docker run -d --name munshi-qdrant -p 6333:6333 -p 6334:6334 qdrant/qdrant
### Step 6 — Install Tesseract + Poppler
- Tesseract: https://github.com/UB-Mannheim/tesseract/wiki
- Poppler: https://github.com/oschwartz10612/poppler-windows/releases
Update paths in app/munshi\_ui.py if installed elsewhere.
### Step 7 — Start Munshi
Two terminals:
**Terminal 1 — Start the LLM server:**
cd llama-server
.\\start\_server.ps1
**Terminal 2 — Ingest documents and start UI:**
cd code
python ingest\_all\_clients.py
cd ..\\app
streamlit run munshi\_ui.py
Open http://localhost:8501 in your browser.
## Project Structure
Munshi/
├── app/
│ ├── munshi\_ui.py # Streamlit web interface
│ └── .streamlit/
│ └── config.toml # Theme + server config
├── code/
│ ├── generate\_full\_dataset.py # Synthetic CA dataset generator
│ ├── ingest\_all\_clients.py # PDF ingestion pipeline
│ ├── query\_full\_firm.py # CLI query tool
│ ├── rag\_first\_query.py # Quick RAG test
│ └── test\_\*.py # Various sanity tests
├── data/
│ └── sharma\_associates/ # Synthetic dataset (fictional firm)
│ ├── acme\_trading/
│ ├── krishna\_restaurant/
│ ├── mehta\_clinic/
│ ├── patel\_textiles/
│ └── techflow\_solutions/
├── start\_server.ps1 # Launch llama-server with CUDA
├── requirements.txt # Python dependencies
├── .gitignore
├── LICENSE
└── README.md
## Performance
Tested on Acer Nitro 5 — RTX 3050 4GB, i5-12500H, 16GB RAM:
- **Cold load:** ~30 seconds (model + embeddings)
- **Inference:** 49 tok/s (full GPU offload, flash attention)
- **Warm queries:** ~70 tok/s
- **VRAM usage:** 2.3 GB / 3.3 GB available
- **Cross-client RAG queries:** 4-15 seconds end-to-end
For production deployment with Qwen 14B on RTX 4090 24GB, expect ~50 tok/s and dramatically better answer quality.
## Roadmap
### Phase 1 — Local Prototype ✅ COMPLETE
- [x] Local LLM serving with CUDA
- [x] RAG pipeline with Qdrant
- [x] OCR for scanned documents
- [x] Multi-client document organization
- [x] Web UI with citations
- [x] Live document upload via UI
### Phase 2 — Production Architecture (Next)
- [ ] Hardware spec for CA firm server
- [ ] Network architecture (LAN + VPN for WFH)
- [ ] Multi-user authentication
- [ ] Role-based access control
- [ ] Audit logging
### Phase 3 — High-Value Workflows
- [ ] GST notice reply drafter (DRC-01, ASMT-10)
- [ ] GSTR-2B vs Purchase Register reconciliation
- [ ] Form 26AS vs TDS book reconciliation
- [ ] Client communication drafter
### Phase 4 — Deployment Kit
- [ ] One-click installer
- [ ] Hardware test suite
- [ ] Backup automation
- [ ] Update mechanism
## License
MIT — see [LICENSE](LICENSE) file.
## Author
**Poojith Devan**
MCA (Generative AI), SRM University
MSc (AI & Data Science), O.P. Jindal Global University
GitHub: [@poojithdevan4D](https://github.com/poojithdevan4D)
---
*Built with care for Indian CA firms who deserve modern AI without compromising client confidentiality.*