Skip to content

darylalim/embedding-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

95 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Embedding Pipeline

Streamlit web app for generating vector embeddings from PDF documents and images and searching over them using Nomic's ColNomic Embed Multimodal 3B model.

Features

  • Multi-PDF and image upload (PNG, JPG, JPEG, WebP) with batch or incremental embedding
  • PDF page rendering at configurable DPI (72, 150, 300) via PyMuPDF
  • Multi-vector embeddings with ColNomic Embed Multimodal 3B
  • Cross-document text search with top-K and score threshold filtering
  • Optional per-document search filtering
  • Automatic device selection (MPS > CUDA > CPU)
  • Per-document and combined JSON downloads with embeddings, DPI, and timing

Setup

uv sync
uv run streamlit run streamlit_app.py

Development

uv run ruff check .   # lint
uv run ruff format .  # format
uv run ty check       # typecheck
uv run pytest         # test

About

Streamlit web app for generating vector embeddings from PDF documents and images and searching over them using Nomic's ColNomic Embed Multimodal 3B model.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages