DocWise

Intelligent document understanding — parse, store, and query your documents.

DocWise combines adaptive document parsing, hybrid semantic search, and an agentic Q&A pipeline into a single self-contained platform. Upload PDF, DOCX, or Markdown files and ask natural-language questions — DocWise returns grounded answers with source citations. It also ships as an MCP server for direct integration with Claude Desktop.

Screenshots

Upload — ingest PDF, DOCX, or Markdown with parsing summary and element type distribution

Ask — natural-language Q&A with grounded answers and source citations

Explorer — browse chunks, element distribution charts, and content search

Features

Adaptive parsing — extracts headings, tables, code blocks, lists, images, and text from PDF (PyMuPDF + pdfplumber), DOCX (python-docx), and Markdown (mistune). Each element is type-tagged.
Smart chunking — content-type-aware strategy: tables and code are preserved whole; long text is recursively split; headings are merged with their following paragraph; lists are split by item.
Hybrid search — combines ChromaDB cosine-similarity (semantic) with BM25 keyword search, fused via Reciprocal Rank Fusion (RRF). No external services needed; runs fully local.
Agentic Q&A pipeline — a Query Understanding agent reformulates and classifies the question, hybrid retrieval fetches the top-K chunks, and an Answer Synthesis agent composes a grounded answer with inline [N] citations.
MCP server — exposes list_documents, search_documents, ask_question, and ingest_document as tools over HTTP/SSE (port 8001). Works with Claude Desktop and any MCP-compatible client.
Streamlit web UI — three-tab interface: Upload, Ask, and Explorer (chunk browser with type-aware charts and search filters).
Local-first embeddings — uses FastEmbed (BAAI/bge-small-en-v1.5, ONNX) by default — no OpenAI key required. Switch to text-embedding-3-small via a single env var.

Quick Start (Docker)

Prerequisites: Docker + Docker Compose

git clone https://github.com/your-username/docwise.git
cd docwise

# 1. Add your LLM API key
cp backend/.env.example backend/.env
# edit backend/.env and set OPENAI_API_KEY or ANTHROPIC_API_KEY

# 2. Build and start all services
cd docker
docker-compose up --build

First build takes a few minutes — it installs dependencies and downloads the FastEmbed ONNX model inside the container.

Service	URL
Streamlit UI	http://localhost:8501
FastAPI + Swagger	http://localhost:8000/docs
MCP SSE endpoint	http://localhost:8001/sse

Useful commands:

# Run in background
docker-compose up --build -d

# View logs
docker-compose logs -f

# View logs for one service
docker-compose logs -f backend
docker-compose logs -f frontend

# Stop services
docker-compose down

# Stop and remove volumes (wipes ChromaDB data)
docker-compose down -v

Quick Start (Local)

Prerequisites: Python 3.12+

# Backend
cd backend
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env   # already in backend/ — add your API key
python main.py         # starts API on :8000 + MCP on :8001

# Frontend (new terminal) — reuse backend venv, no separate pip install
cd frontend
source ../backend/.venv/bin/activate
streamlit run app.py --server.port 8501

Or use the start scripts:

# Terminal 1
./backend/start.sh

# Terminal 2
./frontend/start.sh

Note: Always run backend scripts from the backend/ directory. Use python -m docwise_mcp.server (not python docwise_mcp/server.py) to start the MCP server standalone.

Architecture

Streamlit UI (:8501)
      │  HTTP
      ▼
FastAPI Backend (:8000)
      │
      ├─ POST /ingest/file
      │     ParserRegistry → AdaptiveChunker → FastEmbed → ChromaDB
      │
      ├─ POST /query
      │     QueryUnderstanding → HybridSearch → AnswerSynthesis
      │         (LLM reformulation)   (BM25+RRF)    (citations)
      │
      ├─ GET|DELETE /documents
      │
      └─ MCP Server (:8001 SSE)
            ingest_document / search_documents / ask_question / list_documents

MCP — Claude Desktop Integration

Add to ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "docwise": {
      "url": "http://localhost:8001/sse"
    }
  }
}

Claude Desktop will then have access to all four DocWise tools.

Configuration

All settings are overridable via environment variables or backend/.env:

Variable	Default	Notes
`LLM_PROVIDER`	`openai`	`openai` or `anthropic`
`LLM_MODEL`	`gpt-4o-mini`	any model name
`OPENAI_API_KEY`	—	required if using OpenAI
`ANTHROPIC_API_KEY`	—	required if using Anthropic
`EMBEDDING_PROVIDER`	`fastembed`	`fastembed` or `openai`
`EMBEDDING_MODEL`	`BAAI/bge-small-en-v1.5`	ONNX model name
`CHUNK_SIZE`	`512`	max chars per text chunk
`CHUNK_OVERLAP`	`50`	overlap between chunks
`TOP_K`	`10`	chunks retrieved per query
`CHROMA_PATH`	`./data/chroma`	vector store persistence path
`API_PORT`	`8000`	FastAPI port
`MCP_PORT`	`8001`	MCP SSE port

Project Structure

docwise/
├── backend/
│   ├── main.py                    # Uvicorn entrypoint
│   ├── config.py                  # All settings
│   ├── logger.py                  # structlog setup
│   ├── api/
│   │   ├── app.py                 # FastAPI app factory
│   │   └── routes/
│   │       ├── ingest.py          # POST /ingest/file
│   │       ├── query.py           # POST /query, /search
│   │       └── documents.py       # GET/DELETE /documents
│   ├── parsing/
│   │   ├── base.py                # ParsedElement, ParsedDocument
│   │   ├── registry.py            # ext → parser routing
│   │   ├── pdf_parser.py          # PyMuPDF + pdfplumber
│   │   ├── docx_parser.py         # python-docx
│   │   └── markdown_parser.py     # mistune
│   ├── chunking/
│   │   └── adaptive.py            # Content-type-aware chunking
│   ├── embeddings/
│   │   └── provider.py            # FastEmbed / OpenAI, singleton
│   ├── vectorstore/
│   │   └── chroma.py              # ChromaDB + BM25 + RRF
│   ├── agents/
│   │   ├── query_understanding.py # LLM query reformulation
│   │   ├── answer_synthesis.py    # LLM grounded answer + citations
│   │   └── pipeline.py            # Orchestrates understanding + retrieval + synthesis
│   └── docwise_mcp/
│       └── server.py              # FastMCP HTTP/SSE server
├── frontend/
│   └── app.py                     # Streamlit 3-tab UI
├── docker/
│   ├── Dockerfile.backend
│   ├── Dockerfile.frontend
│   └── docker-compose.yml
├── docs/
│   ├── REQUIREMENTS.md
│   ├── ARCHITECTURE.md
│   └── PLAN.md
└── samples/
    └── sample.md

Tech Stack

Layer	Technology
API	FastAPI + Uvicorn
PDF parsing	PyMuPDF + pdfplumber
DOCX parsing	python-docx
Markdown parsing	mistune
Embeddings	FastEmbed (BAAI/bge-small-en-v1.5)
Vector store	ChromaDB (embedded)
Keyword search	rank-bm25
LLM	OpenAI / Anthropic (configurable)
MCP	mcp (FastMCP, HTTP/SSE)
Web UI	Streamlit
Config	pydantic-settings
Logging	structlog
Containerisation	Docker + Compose

Supported Formats

Format	Parser	Tables	Images	Code	Headings
PDF	PyMuPDF + pdfplumber	✓	✓	—	✓
DOCX	python-docx	✓	—	—	✓
Markdown	mistune	✓	—	✓	✓

REST API Reference

Method	Path	Description
`POST`	`/ingest/file`	Upload and ingest a document
`POST`	`/query`	Ask a question (full agentic pipeline)
`POST`	`/query/search`	Raw hybrid search (no LLM)
`GET`	`/documents`	List all ingested documents
`GET`	`/documents/{doc_id}`	Document metadata + type counts
`GET`	`/documents/{doc_id}/chunks`	All chunks for a document
`DELETE`	`/documents/{doc_id}`	Delete document + its chunks
`GET`	`/health`	Health check + chunk count

Full interactive docs at http://localhost:8000/docs.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
backend		backend
docker		docker
docs/screenshots		docs/screenshots
frontend		frontend
samples		samples
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocWise

Screenshots

Upload — ingest PDF, DOCX, or Markdown with parsing summary and element type distribution

Ask — natural-language Q&A with grounded answers and source citations

Explorer — browse chunks, element distribution charts, and content search

Features

Quick Start (Docker)

Quick Start (Local)

Architecture

MCP — Claude Desktop Integration

Configuration

Project Structure

Tech Stack

Supported Formats

REST API Reference

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DocWise

Screenshots

Upload — ingest PDF, DOCX, or Markdown with parsing summary and element type distribution

Ask — natural-language Q&A with grounded answers and source citations

Explorer — browse chunks, element distribution charts, and content search

Features

Quick Start (Docker)

Quick Start (Local)

Architecture

MCP — Claude Desktop Integration

Configuration

Project Structure

Tech Stack

Supported Formats

REST API Reference

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages