PDF Dashboard With MCP

Upload PDFs, extract their text via GLM-OCR(Ollama)/PyMuPDF, and chat with the document using a local LLM. Everything runs on your machine through Ollama — no API keys or internet connection required.

Features

PDF extraction — fast text-layer extraction via PyMuPDF; automatic GLM-OCR fallback for scanned/image-based PDFs
Per-document RAG — each uploaded PDF gets its own Chroma vector collection
Local LLM chat — agentic Q&A with inline citations powered by Ollama; pick any installed Ollama model from the dropdown
Markdown viewer — browse extracted text, preview chunks, and download the markdown

Prerequisites

Python 3.11+
uv — Python package manager
Ollama — local LLM runtime

Setup

1. Clone the repository

git clone https://github.com/dakshp26/PDFDashboardWithMCP.git
cd PDFDashboardWithMCP

2. Install dependencies

uv sync

3. Pull Ollama models

ollama pull qwen2.5:3b       # chat / agent (or any other chat model)
ollama pull nomic-embed-text # embeddings
ollama pull glm-ocr          # OCR fallback (scanned PDFs)

4. Run the app

uv run streamlit run app/main.py

Open http://localhost:8501 in your browser.

Usage

Upload PDF — go to the Upload PDF page, select a PDF, and wait for the extraction pipeline to finish
Chat — switch to the Chat page, pick your PDF and any installed Ollama model from the dropdowns, and ask questions

Project Structure

app/
├── main.py                       # Entry point, page navigation
├── app_pages/
│   ├── landing.py                # Home / welcome page
│   ├── process_pdf_upload.py     # Upload + pipeline UI
│   ├── pdf_library.py            # Browse uploaded PDFs (read-only viewer)
│   └── process_pdf.py            # Viewer + chat UI
└── process_pdf/
    ├── extract.py                 # PDF → Markdown (pymupdf4llm + GLM-OCR)
    ├── pipeline.py                # Extraction pipeline with live progress
    ├── rag.py                     # Chunking, embeddings, Chroma persistence
    └── agent.py                   # LangChain agent with retriever tool
mcp_server/
└── server.py                     # MCP server (list_documents, get_document)
data/                             # Runtime data (gitignored)
├── process_pdf/                  # Saved PDFs and extracted markdown
└── process_chroma/               # Chroma vector collections (one per PDF)

Note

For a detailed breakdown of every file, execution order, and data flow, see APP_STRUCTURE.md.

Streamlit Dashboard

The app is a three-page Streamlit dashboard:

Page	Description
Home	Welcome page with a quick-start overview
Upload PDF	Select a PDF, watch the extraction pipeline run in real time (text layer → OCR fallback → chunking → embedding), then download the extracted markdown
PDF Library	Browse all previously uploaded PDFs; view extracted markdown and chunk previews without re-running the pipeline
Chat	Pick an indexed PDF and any installed Ollama model, ask questions, and get answers with inline source citations

The pipeline progress is shown live inside an st.status block. After a PDF is processed its vector collection persists in data/process_chroma/, so the next session loads instantly without re-running the pipeline.

MCP Server

The included MCP server exposes the vector store to any MCP-compatible client (Claude Desktop, Cursor, etc.) with two tools:

list_documents — returns all indexed document collections
get_document(document, query) — searches a collection and returns relevant chunks

Claude Desktop

Add to claude_desktop_config.json (usually %APPDATA%\Claude\claude_desktop_config.json on Windows), or use .mcp.json in the project root to keep it project-scoped:

{
  "mcpServers": {
    "PDFDashboardWithMCP": {
      "command": "uv",
      "args": ["run", "--directory", "/absolute/path/to/PDFDashboardWithMCP", "mcp_server/server.py"]
    }
  }
}

Cursor

Add to .cursor/mcp.json in your project root (or the global ~/.cursor/mcp.json):

{
  "mcpServers": {
    "PDFDashboardWithMCP": {
      "command": "uv",
      "args": ["run", "--directory", "/absolute/path/to/PDFDashboardWithMCP", "mcp_server/server.py"]
    }
  }
}

Claude Code

The recommended approach is a project-scoped .mcp.json in the repository root so the server is only active for this project and doesn't pollute your global config:

{
  "mcpServers": {
    "PDFDashboardWithMCP": {
      "command": "uv",
      "args": ["run", "--directory", "/absolute/path/to/PDFDashboardWithMCP", "mcp_server/server.py"]
    }
  }
}

Claude Code picks up .mcp.json automatically when you open the project. No extra setup needed.

Replace /absolute/path/to/PDFDashboardWithMCP with the absolute path to your cloned repository.

Ollama must be running with nomic-embed-text pulled for the MCP server to load collections.

Tech Stack

Component	Library
UI	Streamlit
PDF extraction	langchain-pymupdf4llm, PyMuPDF
OCR fallback	Ollama glm-ocr
Embeddings	Ollama nomic-embed-text
Vector store	Chroma (langchain-chroma)
LLM / agent	Any Ollama chat model (e.g. qwen2.5:3b), LangChain
Package manager	uv
MCP server	mcp[cli]

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
app		app
mcp_server		mcp_server
.gitignore		.gitignore
.mcp.json.example		.mcp.json.example
APP_STRUCTURE.md		APP_STRUCTURE.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Dashboard With MCP

Features

Prerequisites

Setup

Usage

Project Structure

Streamlit Dashboard

MCP Server

Tech Stack

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF Dashboard With MCP

Features

Prerequisites

Setup

Usage

Project Structure

Streamlit Dashboard

MCP Server

Tech Stack

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages