Skip to content

tsejavhaa/Chromadb

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

⬡ VecDock — PDF → Vector Database

A local, privacy-first Streamlit app that turns any PDF into a searchable vector database — no API keys required.

Python Streamlit ChromaDB License


What it does

VecDock walks you through a 6-step wizard:

Upload PDF → Preview → Extract Text → Configure → Embed & Store → Search

Upload any PDF, extract its text (with automatic OCR fallback for scanned pages), split it into chunks, embed those chunks using a local sentence-transformer model, persist everything in ChromaDB, then run semantic or keyword searches against it — all from a clean dark-mode UI.


Screenshots

Step 2 — PDF Reader Step 5 — Live Embed Log Step 6 — Search
Single-page viewer, ◀ ▶ navigation Real-time progress as model loads & embeds Ranked results with similarity scores

Features

  • 6-step guided workflow — each step is independent and resumable
  • Single-page PDF viewer — renders one page at a time, safe for 500+ page documents
  • Automatic OCR fallback — detects image-heavy pages and runs Tesseract automatically
  • 3 chunking strategies — sentence-aware, paragraph-aware, fixed window
  • Configurable overlap — adjustable chunk size (100–2000 chars) and overlap (0–400 chars)
  • Live embedding log — see exactly what's happening (model download, batch progress, ETA)
  • Background threading — embedding runs on a worker thread so the UI never freezes
  • 2 embedding providers — sentence-transformers (local) or Ollama (local LLM server)
  • 5 built-in models — from 90 MB fast models to 420 MB high-quality ones
  • 3 search modes — semantic (cosine), ANN (HNSW), keyword (exact match + vector rank)
  • Page filter — restrict search results to a specific page range
  • JSON export — download search results for use in other tools
  • Persistent storage — ChromaDB persists to disk, survives restarts

Requirements

  • Python 3.10 or newer
  • No external API keys — everything runs locally

Installation

1. Clone the repository

git clone https://github.com/tsejavhaa/chromadb.git
cd chromadb

2. Set up Python with pyenv

# Install the Python version you want (3.10+ required)
pyenv install 3.11.9

# Set it locally for this project
cd vecdock
pyenv local 3.11.9

# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate        # macOS / Linux
.venv\Scripts\activate           # Windows

Tip: After activation, confirm you're using the right Python:

python --version   # should show 3.11.9 (or whichever you set)
which python       # should point inside .venv/

3. Install dependencies

pip install -r requirements.txt

OCR support (optional — for scanned / image PDFs):

# 1. Uncomment the pytesseract / pdf2image lines in requirements.txt, then:
pip install -r requirements.txt

# 2. Install Tesseract on your OS:
#   macOS:   brew install tesseract
#   Ubuntu:  sudo apt install tesseract-ocr
#   Windows: https://github.com/UB-Mannheim/tesseract/wiki

Ollama support (optional — alternative embedding provider):

# Install Ollama from https://ollama.com, then pull a model:
ollama pull nomic-embed-text
# No extra pip package needed.

4. Streamlit config (optional but recommended)

Create .streamlit/config.toml in the project root to raise the file upload limit and disable telemetry:

mkdir -p .streamlit
cat > .streamlit/config.toml << 'EOF'
[server]
maxUploadSize = 500

[browser]
gatherUsageStats = false
EOF

Without this, Streamlit caps uploads at 200 MB — large PDFs will silently fail.

5. Run the app

streamlit run app.py

The app opens at http://localhost:8501.


Project structure

vecdock/
│
├── app.py                      # Entry point — page config, session state, router
│
├── utils/
│   ├── pdf_utils.py            # PDF page count, single-page renderer, text extraction, OCR
│   ├── chunker.py              # Chunk dataclass + 3 splitting strategies
│   └── vectordb.py             # Embedding loader, ChromaDB helpers, search, background worker
│
├── ui/
│   ├── styles.py               # All CSS (dark theme, slider fix, sidebar toggle)
│   └── sidebar.py              # Sidebar nav, DB status card, Start Over button
│
└── pages/
    ├── upload_preview.py       # Step 1: Upload  |  Step 2: PDF reader
    ├── extract_configure.py    # Step 3: Extract text  |  Step 4: Configure
    └── embed_search.py         # Step 5: Embed & Store  |  Step 6: Search

Walkthrough

Step 1 — Upload PDF

Drag and drop any .pdf file. File size and name are shown on confirmation. Uploading a new file automatically clears cached extraction and preview data.

Step 2 — Preview PDF

A single-page viewer renders one page at a time using PyMuPDF. Use ◀ / ▶ buttons or type a page number directly. This approach is memory-safe for documents with hundreds of pages.

Step 3 — Extract Text

Text is extracted with pdfplumber. Pages where the native text layer is sparse (< 50 characters) are automatically sent through Tesseract OCR. Each page shows its character count and source (TEXT or OCR), and empty pages are flagged.

Step 4 — Configure

Embedding model:

Provider Model Size Notes
sentence-transformers all-MiniLM-L6-v2 ~90 MB Fast, great general baseline
sentence-transformers all-MiniLM-L12-v2 ~120 MB Better quality than L6
sentence-transformers all-mpnet-base-v2 ~420 MB High quality, 768-dim
sentence-transformers paraphrase-multilingual-MiniLM-L12-v2 ~470 MB 50+ languages
sentence-transformers BAAI/bge-small-en-v1.5 ~130 MB Strong retrieval benchmarks
Ollama nomic-embed-text local Requires Ollama running

Chunking strategies:

Strategy Description
sentence Splits at sentence boundaries (.!?), fills window up to chunk size
paragraph Groups whole paragraphs up to chunk size, preserves structure
fixed Rolling character window — fastest, no semantic awareness

A live preview shows how your chosen strategy splits the first page, with estimated total chunk count.

Step 5 — Embed & Store

Click Embed & Store in ChromaDB. A live log terminal shows each stage as it happens:

✂️  Building text chunks…
✅  Created 3890 chunks (method=sentence, size=300, overlap=50)

─── Loading embedding model ───────────────
⬇️  Downloading / loading model 'all-MiniLM-L6-v2' from HuggingFace Hub…
    (first run ~90 MB download, subsequent runs load from cache)
✅  Model ready.

─── Setting up ChromaDB ───────────────────
💾  Opening ChromaDB at './chroma_db'…
🆕  Creating collection 'pdf_docs' with cosine similarity…

─── Embedding chunks ──────────────────────
🔢  Embedding and storing 3890 chunks in batches of 64…
     [  1%]  64/3890 chunks stored
     ...
✅  All 3890 chunks embedded and stored.
🎉  Done!

Embedding runs on a background thread — the page stays responsive and auto-refreshes every second.

Step 6 — Search

Enter a natural language query. Three search modes are available:

Mode How it works
semantic Cosine similarity on embeddings — best for paraphrased or conceptual queries
ann Approximate Nearest Neighbor via ChromaDB's HNSW index
keyword Must contain the exact query string, then ranked by vector similarity

Results show rank, page number, similarity score (%), and the matched text chunk. Export all results as JSON with one click.


Configuration

All settings are in-app — no config files needed. The defaults are:

Setting Default Description
db_path ./chroma_db ChromaDB persist directory
collection_name pdf_docs ChromaDB collection name
embed_model sentence-transformers Provider
st_model all-MiniLM-L6-v2 Default sentence-transformer model
chunk_method sentence Chunking strategy
chunk_size 500 Max characters per chunk
chunk_overlap 50 Character overlap between chunks
top_k 5 Default results to retrieve

Architecture notes

Background threading

Model loading and embedding can take 30–120 seconds for large PDFs. Running this on Streamlit's main thread would block Tornado's WebSocket keepalive pings, disconnecting the browser. VecDock uses a ThreadPoolExecutor — the main thread submits the job and polls every second, keeping the connection alive.

Single-page PDF rendering

Rendering all pages to base64 PNGs at startup crashes the WebSocket for large PDFs. Instead, render_single_page() opens the PDF, renders exactly one page, and closes it — no accumulation in memory.

Widget state and navigation

Streamlit's number_input with a key= parameter locks that slot in session_state. Writing to it after widget instantiation raises a StreamlitAPIException. The page navigator avoids this by using no key on the number input — Streamlit uses a positional auto-key and value=cur re-renders it correctly on every rerun.


Troubleshooting

ModuleNotFoundError: No module named 'fitz'

pip install pymupdf

ModuleNotFoundError: No module named 'pdfplumber'

pip install pdfplumber pypdf

OCR not working / blank text on scanned PDFs

pip install pytesseract pdf2image
# Then install Tesseract on your OS (see Installation above)

Ollama connection refused Make sure Ollama is running before starting the app:

ollama serve

chromadb version conflicts ChromaDB updates frequently. If you hit dependency issues:

pip install "chromadb>=0.4.0" --upgrade

Embedding takes very long on first run The model is being downloaded from HuggingFace Hub (~90–420 MB depending on the model). Subsequent runs load from the local cache at ~/.cache/huggingface/. Check the live log in Step 5 for progress.


Dependencies

Package Purpose
streamlit Web UI framework
chromadb Local vector database with HNSW index
sentence-transformers Local embedding models via HuggingFace
pymupdf (fitz) Fast PDF rendering and page-by-page image export
pdfplumber Accurate text extraction from PDFs
pypdf Fallback PDF reader and page count
pytesseract OCR for scanned / image-only pages (optional)
pdf2image Convert PDF pages to PIL images for OCR (optional)

License

MIT — use freely, modify freely, attribution appreciated.


Contributing

Pull requests welcome. The codebase is intentionally modular — each concern lives in its own file with no cross-dependencies except through the utils/ layer.

  1. Fork the repo
  2. Create a feature branch: git checkout -b feature/my-thing
  3. Commit your changes: git commit -m "add my thing"
  4. Push and open a PR

Built with Streamlit · ChromaDB · sentence-transformers

About

Experiments and implementations of vector databases for semantic search and RAG pipelines.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors