Skip to content

rakmohan/docwise

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DocWise

Intelligent document understanding — parse, store, and query your documents.

DocWise combines adaptive document parsing, hybrid semantic search, and an agentic Q&A pipeline into a single self-contained platform. Upload PDF, DOCX, or Markdown files and ask natural-language questions — DocWise returns grounded answers with source citations. It also ships as an MCP server for direct integration with Claude Desktop.


Screenshots

Upload — ingest PDF, DOCX, or Markdown with parsing summary and element type distribution

Upload tab

Ask — natural-language Q&A with grounded answers and source citations

Ask tab Ask tab

Explorer — browse chunks, element distribution charts, and content search

Explorer tab Explorer tab


Features

  • Adaptive parsing — extracts headings, tables, code blocks, lists, images, and text from PDF (PyMuPDF + pdfplumber), DOCX (python-docx), and Markdown (mistune). Each element is type-tagged.
  • Smart chunking — content-type-aware strategy: tables and code are preserved whole; long text is recursively split; headings are merged with their following paragraph; lists are split by item.
  • Hybrid search — combines ChromaDB cosine-similarity (semantic) with BM25 keyword search, fused via Reciprocal Rank Fusion (RRF). No external services needed; runs fully local.
  • Agentic Q&A pipeline — a Query Understanding agent reformulates and classifies the question, hybrid retrieval fetches the top-K chunks, and an Answer Synthesis agent composes a grounded answer with inline [N] citations.
  • MCP server — exposes list_documents, search_documents, ask_question, and ingest_document as tools over HTTP/SSE (port 8001). Works with Claude Desktop and any MCP-compatible client.
  • Streamlit web UI — three-tab interface: Upload, Ask, and Explorer (chunk browser with type-aware charts and search filters).
  • Local-first embeddings — uses FastEmbed (BAAI/bge-small-en-v1.5, ONNX) by default — no OpenAI key required. Switch to text-embedding-3-small via a single env var.

Quick Start (Docker)

Prerequisites: Docker + Docker Compose

git clone https://github.com/your-username/docwise.git
cd docwise

# 1. Add your LLM API key
cp backend/.env.example backend/.env
# edit backend/.env and set OPENAI_API_KEY or ANTHROPIC_API_KEY

# 2. Build and start all services
cd docker
docker-compose up --build

First build takes a few minutes — it installs dependencies and downloads the FastEmbed ONNX model inside the container.

Service URL
Streamlit UI http://localhost:8501
FastAPI + Swagger http://localhost:8000/docs
MCP SSE endpoint http://localhost:8001/sse

Useful commands:

# Run in background
docker-compose up --build -d

# View logs
docker-compose logs -f

# View logs for one service
docker-compose logs -f backend
docker-compose logs -f frontend

# Stop services
docker-compose down

# Stop and remove volumes (wipes ChromaDB data)
docker-compose down -v

Quick Start (Local)

Prerequisites: Python 3.12+

# Backend
cd backend
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env   # already in backend/ — add your API key
python main.py         # starts API on :8000 + MCP on :8001

# Frontend (new terminal) — reuse backend venv, no separate pip install
cd frontend
source ../backend/.venv/bin/activate
streamlit run app.py --server.port 8501

Or use the start scripts:

# Terminal 1
./backend/start.sh

# Terminal 2
./frontend/start.sh

Note: Always run backend scripts from the backend/ directory. Use python -m docwise_mcp.server (not python docwise_mcp/server.py) to start the MCP server standalone.


Architecture

Streamlit UI (:8501)
      │  HTTP
      ▼
FastAPI Backend (:8000)
      │
      ├─ POST /ingest/file
      │     ParserRegistry → AdaptiveChunker → FastEmbed → ChromaDB
      │
      ├─ POST /query
      │     QueryUnderstanding → HybridSearch → AnswerSynthesis
      │         (LLM reformulation)   (BM25+RRF)    (citations)
      │
      ├─ GET|DELETE /documents
      │
      └─ MCP Server (:8001 SSE)
            ingest_document / search_documents / ask_question / list_documents

MCP — Claude Desktop Integration

Add to ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "docwise": {
      "url": "http://localhost:8001/sse"
    }
  }
}

Claude Desktop will then have access to all four DocWise tools.


Configuration

All settings are overridable via environment variables or backend/.env:

Variable Default Notes
LLM_PROVIDER openai openai or anthropic
LLM_MODEL gpt-4o-mini any model name
OPENAI_API_KEY required if using OpenAI
ANTHROPIC_API_KEY required if using Anthropic
EMBEDDING_PROVIDER fastembed fastembed or openai
EMBEDDING_MODEL BAAI/bge-small-en-v1.5 ONNX model name
CHUNK_SIZE 512 max chars per text chunk
CHUNK_OVERLAP 50 overlap between chunks
TOP_K 10 chunks retrieved per query
CHROMA_PATH ./data/chroma vector store persistence path
API_PORT 8000 FastAPI port
MCP_PORT 8001 MCP SSE port

Project Structure

docwise/
├── backend/
│   ├── main.py                    # Uvicorn entrypoint
│   ├── config.py                  # All settings
│   ├── logger.py                  # structlog setup
│   ├── api/
│   │   ├── app.py                 # FastAPI app factory
│   │   └── routes/
│   │       ├── ingest.py          # POST /ingest/file
│   │       ├── query.py           # POST /query, /search
│   │       └── documents.py       # GET/DELETE /documents
│   ├── parsing/
│   │   ├── base.py                # ParsedElement, ParsedDocument
│   │   ├── registry.py            # ext → parser routing
│   │   ├── pdf_parser.py          # PyMuPDF + pdfplumber
│   │   ├── docx_parser.py         # python-docx
│   │   └── markdown_parser.py     # mistune
│   ├── chunking/
│   │   └── adaptive.py            # Content-type-aware chunking
│   ├── embeddings/
│   │   └── provider.py            # FastEmbed / OpenAI, singleton
│   ├── vectorstore/
│   │   └── chroma.py              # ChromaDB + BM25 + RRF
│   ├── agents/
│   │   ├── query_understanding.py # LLM query reformulation
│   │   ├── answer_synthesis.py    # LLM grounded answer + citations
│   │   └── pipeline.py            # Orchestrates understanding + retrieval + synthesis
│   └── docwise_mcp/
│       └── server.py              # FastMCP HTTP/SSE server
├── frontend/
│   └── app.py                     # Streamlit 3-tab UI
├── docker/
│   ├── Dockerfile.backend
│   ├── Dockerfile.frontend
│   └── docker-compose.yml
├── docs/
│   ├── REQUIREMENTS.md
│   ├── ARCHITECTURE.md
│   └── PLAN.md
└── samples/
    └── sample.md

Tech Stack

Layer Technology
API FastAPI + Uvicorn
PDF parsing PyMuPDF + pdfplumber
DOCX parsing python-docx
Markdown parsing mistune
Embeddings FastEmbed (BAAI/bge-small-en-v1.5)
Vector store ChromaDB (embedded)
Keyword search rank-bm25
LLM OpenAI / Anthropic (configurable)
MCP mcp (FastMCP, HTTP/SSE)
Web UI Streamlit
Config pydantic-settings
Logging structlog
Containerisation Docker + Compose

Supported Formats

Format Parser Tables Images Code Headings
PDF PyMuPDF + pdfplumber
DOCX python-docx
Markdown mistune

REST API Reference

Method Path Description
POST /ingest/file Upload and ingest a document
POST /query Ask a question (full agentic pipeline)
POST /query/search Raw hybrid search (no LLM)
GET /documents List all ingested documents
GET /documents/{doc_id} Document metadata + type counts
GET /documents/{doc_id}/chunks All chunks for a document
DELETE /documents/{doc_id} Delete document + its chunks
GET /health Health check + chunk count

Full interactive docs at http://localhost:8000/docs.

About

RAG-in-a-Box — intelligent document understanding for PDF, DOCX, and Markdown. Adaptive parsing, hybrid search (semantic + BM25), and agentic Q&A with citations. Upload, ask questions, and explore stats via built-in Streamlit app, REST API, or Claude Desktop (MCP).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors