A local, privacy-first Streamlit app that turns any PDF into a searchable vector database — no API keys required.
VecDock walks you through a 6-step wizard:
Upload PDF → Preview → Extract Text → Configure → Embed & Store → Search
Upload any PDF, extract its text (with automatic OCR fallback for scanned pages), split it into chunks, embed those chunks using a local sentence-transformer model, persist everything in ChromaDB, then run semantic or keyword searches against it — all from a clean dark-mode UI.
| Step 2 — PDF Reader | Step 5 — Live Embed Log | Step 6 — Search |
|---|---|---|
| Single-page viewer, ◀ ▶ navigation | Real-time progress as model loads & embeds | Ranked results with similarity scores |
- 6-step guided workflow — each step is independent and resumable
- Single-page PDF viewer — renders one page at a time, safe for 500+ page documents
- Automatic OCR fallback — detects image-heavy pages and runs Tesseract automatically
- 3 chunking strategies — sentence-aware, paragraph-aware, fixed window
- Configurable overlap — adjustable chunk size (100–2000 chars) and overlap (0–400 chars)
- Live embedding log — see exactly what's happening (model download, batch progress, ETA)
- Background threading — embedding runs on a worker thread so the UI never freezes
- 2 embedding providers — sentence-transformers (local) or Ollama (local LLM server)
- 5 built-in models — from 90 MB fast models to 420 MB high-quality ones
- 3 search modes — semantic (cosine), ANN (HNSW), keyword (exact match + vector rank)
- Page filter — restrict search results to a specific page range
- JSON export — download search results for use in other tools
- Persistent storage — ChromaDB persists to disk, survives restarts
- Python 3.10 or newer
- No external API keys — everything runs locally
git clone https://github.com/tsejavhaa/chromadb.git
cd chromadb# Install the Python version you want (3.10+ required)
pyenv install 3.11.9
# Set it locally for this project
cd vecdock
pyenv local 3.11.9
# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate # macOS / Linux
.venv\Scripts\activate # WindowsTip: After activation, confirm you're using the right Python:
python --version # should show 3.11.9 (or whichever you set) which python # should point inside .venv/
pip install -r requirements.txtOCR support (optional — for scanned / image PDFs):
# 1. Uncomment the pytesseract / pdf2image lines in requirements.txt, then:
pip install -r requirements.txt
# 2. Install Tesseract on your OS:
# macOS: brew install tesseract
# Ubuntu: sudo apt install tesseract-ocr
# Windows: https://github.com/UB-Mannheim/tesseract/wikiOllama support (optional — alternative embedding provider):
# Install Ollama from https://ollama.com, then pull a model:
ollama pull nomic-embed-text
# No extra pip package needed.Create .streamlit/config.toml in the project root to raise the file upload limit and disable telemetry:
mkdir -p .streamlit
cat > .streamlit/config.toml << 'EOF'
[server]
maxUploadSize = 500
[browser]
gatherUsageStats = false
EOFWithout this, Streamlit caps uploads at 200 MB — large PDFs will silently fail.
streamlit run app.pyThe app opens at http://localhost:8501.
vecdock/
│
├── app.py # Entry point — page config, session state, router
│
├── utils/
│ ├── pdf_utils.py # PDF page count, single-page renderer, text extraction, OCR
│ ├── chunker.py # Chunk dataclass + 3 splitting strategies
│ └── vectordb.py # Embedding loader, ChromaDB helpers, search, background worker
│
├── ui/
│ ├── styles.py # All CSS (dark theme, slider fix, sidebar toggle)
│ └── sidebar.py # Sidebar nav, DB status card, Start Over button
│
└── pages/
├── upload_preview.py # Step 1: Upload | Step 2: PDF reader
├── extract_configure.py # Step 3: Extract text | Step 4: Configure
└── embed_search.py # Step 5: Embed & Store | Step 6: Search
Drag and drop any .pdf file. File size and name are shown on confirmation. Uploading a new file automatically clears cached extraction and preview data.
A single-page viewer renders one page at a time using PyMuPDF. Use ◀ / ▶ buttons or type a page number directly. This approach is memory-safe for documents with hundreds of pages.
Text is extracted with pdfplumber. Pages where the native text layer is sparse (< 50 characters) are automatically sent through Tesseract OCR. Each page shows its character count and source (TEXT or OCR), and empty pages are flagged.
Embedding model:
| Provider | Model | Size | Notes |
|---|---|---|---|
| sentence-transformers | all-MiniLM-L6-v2 |
~90 MB | Fast, great general baseline |
| sentence-transformers | all-MiniLM-L12-v2 |
~120 MB | Better quality than L6 |
| sentence-transformers | all-mpnet-base-v2 |
~420 MB | High quality, 768-dim |
| sentence-transformers | paraphrase-multilingual-MiniLM-L12-v2 |
~470 MB | 50+ languages |
| sentence-transformers | BAAI/bge-small-en-v1.5 |
~130 MB | Strong retrieval benchmarks |
| Ollama | nomic-embed-text |
local | Requires Ollama running |
Chunking strategies:
| Strategy | Description |
|---|---|
sentence |
Splits at sentence boundaries (.!?), fills window up to chunk size |
paragraph |
Groups whole paragraphs up to chunk size, preserves structure |
fixed |
Rolling character window — fastest, no semantic awareness |
A live preview shows how your chosen strategy splits the first page, with estimated total chunk count.
Click Embed & Store in ChromaDB. A live log terminal shows each stage as it happens:
✂️ Building text chunks…
✅ Created 3890 chunks (method=sentence, size=300, overlap=50)
─── Loading embedding model ───────────────
⬇️ Downloading / loading model 'all-MiniLM-L6-v2' from HuggingFace Hub…
(first run ~90 MB download, subsequent runs load from cache)
✅ Model ready.
─── Setting up ChromaDB ───────────────────
💾 Opening ChromaDB at './chroma_db'…
🆕 Creating collection 'pdf_docs' with cosine similarity…
─── Embedding chunks ──────────────────────
🔢 Embedding and storing 3890 chunks in batches of 64…
[ 1%] 64/3890 chunks stored
...
✅ All 3890 chunks embedded and stored.
🎉 Done!
Embedding runs on a background thread — the page stays responsive and auto-refreshes every second.
Enter a natural language query. Three search modes are available:
| Mode | How it works |
|---|---|
semantic |
Cosine similarity on embeddings — best for paraphrased or conceptual queries |
ann |
Approximate Nearest Neighbor via ChromaDB's HNSW index |
keyword |
Must contain the exact query string, then ranked by vector similarity |
Results show rank, page number, similarity score (%), and the matched text chunk. Export all results as JSON with one click.
All settings are in-app — no config files needed. The defaults are:
| Setting | Default | Description |
|---|---|---|
db_path |
./chroma_db |
ChromaDB persist directory |
collection_name |
pdf_docs |
ChromaDB collection name |
embed_model |
sentence-transformers |
Provider |
st_model |
all-MiniLM-L6-v2 |
Default sentence-transformer model |
chunk_method |
sentence |
Chunking strategy |
chunk_size |
500 |
Max characters per chunk |
chunk_overlap |
50 |
Character overlap between chunks |
top_k |
5 |
Default results to retrieve |
Model loading and embedding can take 30–120 seconds for large PDFs. Running this on Streamlit's main thread would block Tornado's WebSocket keepalive pings, disconnecting the browser. VecDock uses a ThreadPoolExecutor — the main thread submits the job and polls every second, keeping the connection alive.
Rendering all pages to base64 PNGs at startup crashes the WebSocket for large PDFs. Instead, render_single_page() opens the PDF, renders exactly one page, and closes it — no accumulation in memory.
Streamlit's number_input with a key= parameter locks that slot in session_state. Writing to it after widget instantiation raises a StreamlitAPIException. The page navigator avoids this by using no key on the number input — Streamlit uses a positional auto-key and value=cur re-renders it correctly on every rerun.
ModuleNotFoundError: No module named 'fitz'
pip install pymupdfModuleNotFoundError: No module named 'pdfplumber'
pip install pdfplumber pypdfOCR not working / blank text on scanned PDFs
pip install pytesseract pdf2image
# Then install Tesseract on your OS (see Installation above)Ollama connection refused Make sure Ollama is running before starting the app:
ollama servechromadb version conflicts
ChromaDB updates frequently. If you hit dependency issues:
pip install "chromadb>=0.4.0" --upgradeEmbedding takes very long on first run
The model is being downloaded from HuggingFace Hub (~90–420 MB depending on the model). Subsequent runs load from the local cache at ~/.cache/huggingface/. Check the live log in Step 5 for progress.
| Package | Purpose |
|---|---|
streamlit |
Web UI framework |
chromadb |
Local vector database with HNSW index |
sentence-transformers |
Local embedding models via HuggingFace |
pymupdf (fitz) |
Fast PDF rendering and page-by-page image export |
pdfplumber |
Accurate text extraction from PDFs |
pypdf |
Fallback PDF reader and page count |
pytesseract |
OCR for scanned / image-only pages (optional) |
pdf2image |
Convert PDF pages to PIL images for OCR (optional) |
MIT — use freely, modify freely, attribution appreciated.
Pull requests welcome. The codebase is intentionally modular — each concern lives in its own file with no cross-dependencies except through the utils/ layer.
- Fork the repo
- Create a feature branch:
git checkout -b feature/my-thing - Commit your changes:
git commit -m "add my thing" - Push and open a PR
Built with Streamlit · ChromaDB · sentence-transformers