⬡ VecDock — PDF → Vector Database

A local, privacy-first Streamlit app that turns any PDF into a searchable vector database — no API keys required.

What it does

VecDock walks you through a 6-step wizard:

Upload PDF → Preview → Extract Text → Configure → Embed & Store → Search

Upload any PDF, extract its text (with automatic OCR fallback for scanned pages), split it into chunks, embed those chunks using a local sentence-transformer model, persist everything in ChromaDB, then run semantic or keyword searches against it — all from a clean dark-mode UI.

Screenshots

Step 2 — PDF Reader	Step 5 — Live Embed Log	Step 6 — Search
Single-page viewer, ◀ ▶ navigation	Real-time progress as model loads & embeds	Ranked results with similarity scores

Features

6-step guided workflow — each step is independent and resumable
Single-page PDF viewer — renders one page at a time, safe for 500+ page documents
Automatic OCR fallback — detects image-heavy pages and runs Tesseract automatically
3 chunking strategies — sentence-aware, paragraph-aware, fixed window
Configurable overlap — adjustable chunk size (100–2000 chars) and overlap (0–400 chars)
Live embedding log — see exactly what's happening (model download, batch progress, ETA)
Background threading — embedding runs on a worker thread so the UI never freezes
2 embedding providers — sentence-transformers (local) or Ollama (local LLM server)
5 built-in models — from 90 MB fast models to 420 MB high-quality ones
3 search modes — semantic (cosine), ANN (HNSW), keyword (exact match + vector rank)
Page filter — restrict search results to a specific page range
JSON export — download search results for use in other tools
Persistent storage — ChromaDB persists to disk, survives restarts

Requirements

Python 3.10 or newer
No external API keys — everything runs locally

Installation

1. Clone the repository

git clone https://github.com/tsejavhaa/chromadb.git
cd chromadb

2. Set up Python with pyenv

# Install the Python version you want (3.10+ required)
pyenv install 3.11.9

# Set it locally for this project
cd vecdock
pyenv local 3.11.9

# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate        # macOS / Linux
.venv\Scripts\activate           # Windows

Tip: After activation, confirm you're using the right Python:

python --version   # should show 3.11.9 (or whichever you set)
which python       # should point inside .venv/

3. Install dependencies

pip install -r requirements.txt

OCR support (optional — for scanned / image PDFs):

# 1. Uncomment the pytesseract / pdf2image lines in requirements.txt, then:
pip install -r requirements.txt

# 2. Install Tesseract on your OS:
#   macOS:   brew install tesseract
#   Ubuntu:  sudo apt install tesseract-ocr
#   Windows: https://github.com/UB-Mannheim/tesseract/wiki

Ollama support (optional — alternative embedding provider):

# Install Ollama from https://ollama.com, then pull a model:
ollama pull nomic-embed-text
# No extra pip package needed.

4. Streamlit config (optional but recommended)

Create .streamlit/config.toml in the project root to raise the file upload limit and disable telemetry:

mkdir -p .streamlit
cat > .streamlit/config.toml << 'EOF'
[server]
maxUploadSize = 500

[browser]
gatherUsageStats = false
EOF

Without this, Streamlit caps uploads at 200 MB — large PDFs will silently fail.

5. Run the app

streamlit run app.py

The app opens at http://localhost:8501.

Project structure

vecdock/
│
├── app.py                      # Entry point — page config, session state, router
│
├── utils/
│   ├── pdf_utils.py            # PDF page count, single-page renderer, text extraction, OCR
│   ├── chunker.py              # Chunk dataclass + 3 splitting strategies
│   └── vectordb.py             # Embedding loader, ChromaDB helpers, search, background worker
│
├── ui/
│   ├── styles.py               # All CSS (dark theme, slider fix, sidebar toggle)
│   └── sidebar.py              # Sidebar nav, DB status card, Start Over button
│
└── pages/
    ├── upload_preview.py       # Step 1: Upload  |  Step 2: PDF reader
    ├── extract_configure.py    # Step 3: Extract text  |  Step 4: Configure
    └── embed_search.py         # Step 5: Embed & Store  |  Step 6: Search

Walkthrough

Step 1 — Upload PDF

Drag and drop any .pdf file. File size and name are shown on confirmation. Uploading a new file automatically clears cached extraction and preview data.

Step 2 — Preview PDF

A single-page viewer renders one page at a time using PyMuPDF. Use ◀ / ▶ buttons or type a page number directly. This approach is memory-safe for documents with hundreds of pages.

Step 3 — Extract Text

Text is extracted with pdfplumber. Pages where the native text layer is sparse (< 50 characters) are automatically sent through Tesseract OCR. Each page shows its character count and source (TEXT or OCR), and empty pages are flagged.

Step 4 — Configure

Embedding model:

Provider	Model	Size	Notes
sentence-transformers	`all-MiniLM-L6-v2`	~90 MB	Fast, great general baseline
sentence-transformers	`all-MiniLM-L12-v2`	~120 MB	Better quality than L6
sentence-transformers	`all-mpnet-base-v2`	~420 MB	High quality, 768-dim
sentence-transformers	`paraphrase-multilingual-MiniLM-L12-v2`	~470 MB	50+ languages
sentence-transformers	`BAAI/bge-small-en-v1.5`	~130 MB	Strong retrieval benchmarks
Ollama	`nomic-embed-text`	local	Requires Ollama running

Chunking strategies:

Strategy	Description
`sentence`	Splits at sentence boundaries (`.!?`), fills window up to chunk size
`paragraph`	Groups whole paragraphs up to chunk size, preserves structure
`fixed`	Rolling character window — fastest, no semantic awareness

A live preview shows how your chosen strategy splits the first page, with estimated total chunk count.

Step 5 — Embed & Store

Click Embed & Store in ChromaDB. A live log terminal shows each stage as it happens:

✂️  Building text chunks…
✅  Created 3890 chunks (method=sentence, size=300, overlap=50)

─── Loading embedding model ───────────────
⬇️  Downloading / loading model 'all-MiniLM-L6-v2' from HuggingFace Hub…
    (first run ~90 MB download, subsequent runs load from cache)
✅  Model ready.

─── Setting up ChromaDB ───────────────────
💾  Opening ChromaDB at './chroma_db'…
🆕  Creating collection 'pdf_docs' with cosine similarity…

─── Embedding chunks ──────────────────────
🔢  Embedding and storing 3890 chunks in batches of 64…
     [  1%]  64/3890 chunks stored
     ...
✅  All 3890 chunks embedded and stored.
🎉  Done!

Embedding runs on a background thread — the page stays responsive and auto-refreshes every second.

Step 6 — Search

Enter a natural language query. Three search modes are available:

Mode	How it works
`semantic`	Cosine similarity on embeddings — best for paraphrased or conceptual queries
`ann`	Approximate Nearest Neighbor via ChromaDB's HNSW index
`keyword`	Must contain the exact query string, then ranked by vector similarity

Results show rank, page number, similarity score (%), and the matched text chunk. Export all results as JSON with one click.

Configuration

All settings are in-app — no config files needed. The defaults are:

Setting	Default	Description
`db_path`	`./chroma_db`	ChromaDB persist directory
`collection_name`	`pdf_docs`	ChromaDB collection name
`embed_model`	`sentence-transformers`	Provider
`st_model`	`all-MiniLM-L6-v2`	Default sentence-transformer model
`chunk_method`	`sentence`	Chunking strategy
`chunk_size`	`500`	Max characters per chunk
`chunk_overlap`	`50`	Character overlap between chunks
`top_k`	`5`	Default results to retrieve

Architecture notes

Background threading

Model loading and embedding can take 30–120 seconds for large PDFs. Running this on Streamlit's main thread would block Tornado's WebSocket keepalive pings, disconnecting the browser. VecDock uses a ThreadPoolExecutor — the main thread submits the job and polls every second, keeping the connection alive.

Single-page PDF rendering

Rendering all pages to base64 PNGs at startup crashes the WebSocket for large PDFs. Instead, render_single_page() opens the PDF, renders exactly one page, and closes it — no accumulation in memory.

Widget state and navigation

Streamlit's number_input with a key= parameter locks that slot in session_state. Writing to it after widget instantiation raises a StreamlitAPIException. The page navigator avoids this by using no key on the number input — Streamlit uses a positional auto-key and value=cur re-renders it correctly on every rerun.

Troubleshooting

ModuleNotFoundError: No module named 'fitz'

pip install pymupdf

ModuleNotFoundError: No module named 'pdfplumber'

pip install pdfplumber pypdf

OCR not working / blank text on scanned PDFs

pip install pytesseract pdf2image
# Then install Tesseract on your OS (see Installation above)

Ollama connection refused Make sure Ollama is running before starting the app:

ollama serve

chromadb version conflicts ChromaDB updates frequently. If you hit dependency issues:

pip install "chromadb>=0.4.0" --upgrade

Embedding takes very long on first run The model is being downloaded from HuggingFace Hub (~90–420 MB depending on the model). Subsequent runs load from the local cache at ~/.cache/huggingface/. Check the live log in Step 5 for progress.

Dependencies

Package	Purpose
`streamlit`	Web UI framework
`chromadb`	Local vector database with HNSW index
`sentence-transformers`	Local embedding models via HuggingFace
`pymupdf` (`fitz`)	Fast PDF rendering and page-by-page image export
`pdfplumber`	Accurate text extraction from PDFs
`pypdf`	Fallback PDF reader and page count
`pytesseract`	OCR for scanned / image-only pages (optional)
`pdf2image`	Convert PDF pages to PIL images for OCR (optional)

License

MIT — use freely, modify freely, attribution appreciated.

Contributing

Pull requests welcome. The codebase is intentionally modular — each concern lives in its own file with no cross-dependencies except through the utils/ layer.

Fork the repo
Create a feature branch: git checkout -b feature/my-thing
Commit your changes: git commit -m "add my thing"
Push and open a PR

Built with Streamlit · ChromaDB · sentence-transformers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⬡ VecDock — PDF → Vector Database

What it does

Screenshots

Features

Requirements

Installation

1. Clone the repository

2. Set up Python with pyenv

3. Install dependencies

4. Streamlit config (optional but recommended)

5. Run the app

Project structure

Walkthrough

Step 1 — Upload PDF

Step 2 — Preview PDF

Step 3 — Extract Text

Step 4 — Configure

Step 5 — Embed & Store

Step 6 — Search

Configuration

Architecture notes

Background threading

Single-page PDF rendering

Widget state and navigation

Troubleshooting

Dependencies

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.streamlit		.streamlit
chroma_db		chroma_db
images		images
notebook_chroma_basics		notebook_chroma_basics
pages		pages
ui		ui
utils		utils
.gitignore		.gitignore
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

⬡ VecDock — PDF → Vector Database

What it does

Screenshots

Features

Requirements

Installation

1. Clone the repository

2. Set up Python with pyenv

3. Install dependencies

4. Streamlit config (optional but recommended)

5. Run the app

Project structure

Walkthrough

Step 1 — Upload PDF

Step 2 — Preview PDF

Step 3 — Extract Text

Step 4 — Configure

Step 5 — Embed & Store

Step 6 — Search

Configuration

Architecture notes

Background threading

Single-page PDF rendering

Widget state and navigation

Troubleshooting

Dependencies

License

Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages