Intelligent document understanding — parse, store, and query your documents.
DocWise combines adaptive document parsing, hybrid semantic search, and an agentic Q&A pipeline into a single self-contained platform. Upload PDF, DOCX, or Markdown files and ask natural-language questions — DocWise returns grounded answers with source citations. It also ships as an MCP server for direct integration with Claude Desktop.
- Adaptive parsing — extracts headings, tables, code blocks, lists, images, and text from PDF (PyMuPDF + pdfplumber), DOCX (python-docx), and Markdown (mistune). Each element is type-tagged.
- Smart chunking — content-type-aware strategy: tables and code are preserved whole; long text is recursively split; headings are merged with their following paragraph; lists are split by item.
- Hybrid search — combines ChromaDB cosine-similarity (semantic) with BM25 keyword search, fused via Reciprocal Rank Fusion (RRF). No external services needed; runs fully local.
- Agentic Q&A pipeline — a Query Understanding agent reformulates and classifies the question, hybrid retrieval fetches the top-K chunks, and an Answer Synthesis agent composes a grounded answer with inline
[N]citations. - MCP server — exposes
list_documents,search_documents,ask_question, andingest_documentas tools over HTTP/SSE (port 8001). Works with Claude Desktop and any MCP-compatible client. - Streamlit web UI — three-tab interface: Upload, Ask, and Explorer (chunk browser with type-aware charts and search filters).
- Local-first embeddings — uses FastEmbed (
BAAI/bge-small-en-v1.5, ONNX) by default — no OpenAI key required. Switch totext-embedding-3-smallvia a single env var.
Prerequisites: Docker + Docker Compose
git clone https://github.com/your-username/docwise.git
cd docwise
# 1. Add your LLM API key
cp backend/.env.example backend/.env
# edit backend/.env and set OPENAI_API_KEY or ANTHROPIC_API_KEY
# 2. Build and start all services
cd docker
docker-compose up --buildFirst build takes a few minutes — it installs dependencies and downloads the FastEmbed ONNX model inside the container.
| Service | URL |
|---|---|
| Streamlit UI | http://localhost:8501 |
| FastAPI + Swagger | http://localhost:8000/docs |
| MCP SSE endpoint | http://localhost:8001/sse |
Useful commands:
# Run in background
docker-compose up --build -d
# View logs
docker-compose logs -f
# View logs for one service
docker-compose logs -f backend
docker-compose logs -f frontend
# Stop services
docker-compose down
# Stop and remove volumes (wipes ChromaDB data)
docker-compose down -vPrerequisites: Python 3.12+
# Backend
cd backend
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env # already in backend/ — add your API key
python main.py # starts API on :8000 + MCP on :8001
# Frontend (new terminal) — reuse backend venv, no separate pip install
cd frontend
source ../backend/.venv/bin/activate
streamlit run app.py --server.port 8501Or use the start scripts:
# Terminal 1
./backend/start.sh
# Terminal 2
./frontend/start.shNote: Always run backend scripts from the
backend/directory. Usepython -m docwise_mcp.server(notpython docwise_mcp/server.py) to start the MCP server standalone.
Streamlit UI (:8501)
│ HTTP
▼
FastAPI Backend (:8000)
│
├─ POST /ingest/file
│ ParserRegistry → AdaptiveChunker → FastEmbed → ChromaDB
│
├─ POST /query
│ QueryUnderstanding → HybridSearch → AnswerSynthesis
│ (LLM reformulation) (BM25+RRF) (citations)
│
├─ GET|DELETE /documents
│
└─ MCP Server (:8001 SSE)
ingest_document / search_documents / ask_question / list_documents
Add to ~/Library/Application Support/Claude/claude_desktop_config.json:
{
"mcpServers": {
"docwise": {
"url": "http://localhost:8001/sse"
}
}
}Claude Desktop will then have access to all four DocWise tools.
All settings are overridable via environment variables or backend/.env:
| Variable | Default | Notes |
|---|---|---|
LLM_PROVIDER |
openai |
openai or anthropic |
LLM_MODEL |
gpt-4o-mini |
any model name |
OPENAI_API_KEY |
— | required if using OpenAI |
ANTHROPIC_API_KEY |
— | required if using Anthropic |
EMBEDDING_PROVIDER |
fastembed |
fastembed or openai |
EMBEDDING_MODEL |
BAAI/bge-small-en-v1.5 |
ONNX model name |
CHUNK_SIZE |
512 |
max chars per text chunk |
CHUNK_OVERLAP |
50 |
overlap between chunks |
TOP_K |
10 |
chunks retrieved per query |
CHROMA_PATH |
./data/chroma |
vector store persistence path |
API_PORT |
8000 |
FastAPI port |
MCP_PORT |
8001 |
MCP SSE port |
docwise/
├── backend/
│ ├── main.py # Uvicorn entrypoint
│ ├── config.py # All settings
│ ├── logger.py # structlog setup
│ ├── api/
│ │ ├── app.py # FastAPI app factory
│ │ └── routes/
│ │ ├── ingest.py # POST /ingest/file
│ │ ├── query.py # POST /query, /search
│ │ └── documents.py # GET/DELETE /documents
│ ├── parsing/
│ │ ├── base.py # ParsedElement, ParsedDocument
│ │ ├── registry.py # ext → parser routing
│ │ ├── pdf_parser.py # PyMuPDF + pdfplumber
│ │ ├── docx_parser.py # python-docx
│ │ └── markdown_parser.py # mistune
│ ├── chunking/
│ │ └── adaptive.py # Content-type-aware chunking
│ ├── embeddings/
│ │ └── provider.py # FastEmbed / OpenAI, singleton
│ ├── vectorstore/
│ │ └── chroma.py # ChromaDB + BM25 + RRF
│ ├── agents/
│ │ ├── query_understanding.py # LLM query reformulation
│ │ ├── answer_synthesis.py # LLM grounded answer + citations
│ │ └── pipeline.py # Orchestrates understanding + retrieval + synthesis
│ └── docwise_mcp/
│ └── server.py # FastMCP HTTP/SSE server
├── frontend/
│ └── app.py # Streamlit 3-tab UI
├── docker/
│ ├── Dockerfile.backend
│ ├── Dockerfile.frontend
│ └── docker-compose.yml
├── docs/
│ ├── REQUIREMENTS.md
│ ├── ARCHITECTURE.md
│ └── PLAN.md
└── samples/
└── sample.md
| Layer | Technology |
|---|---|
| API | FastAPI + Uvicorn |
| PDF parsing | PyMuPDF + pdfplumber |
| DOCX parsing | python-docx |
| Markdown parsing | mistune |
| Embeddings | FastEmbed (BAAI/bge-small-en-v1.5) |
| Vector store | ChromaDB (embedded) |
| Keyword search | rank-bm25 |
| LLM | OpenAI / Anthropic (configurable) |
| MCP | mcp (FastMCP, HTTP/SSE) |
| Web UI | Streamlit |
| Config | pydantic-settings |
| Logging | structlog |
| Containerisation | Docker + Compose |
| Format | Parser | Tables | Images | Code | Headings |
|---|---|---|---|---|---|
| PyMuPDF + pdfplumber | ✓ | ✓ | — | ✓ | |
| DOCX | python-docx | ✓ | — | — | ✓ |
| Markdown | mistune | ✓ | — | ✓ | ✓ |
| Method | Path | Description |
|---|---|---|
POST |
/ingest/file |
Upload and ingest a document |
POST |
/query |
Ask a question (full agentic pipeline) |
POST |
/query/search |
Raw hybrid search (no LLM) |
GET |
/documents |
List all ingested documents |
GET |
/documents/{doc_id} |
Document metadata + type counts |
GET |
/documents/{doc_id}/chunks |
All chunks for a document |
DELETE |
/documents/{doc_id} |
Delete document + its chunks |
GET |
/health |
Health check + chunk count |
Full interactive docs at http://localhost:8000/docs.




