Content Index - Multimodal Segment Search (OpenCLIP + Whisper + Faiss)

A tiny media search engine. It ingests videos, extracts 1 fps frames and audio, transcribes speech to text, embeds visuals and text, builds a vector index, and serves semantic search.
Bonus: segment-level retrieval (30-second windows) and a Streamlit viewer with thumbnails, jump-to-time, and GIF export.

Screenshots

HTTP docs page

Segment viewer UI

Sample results

Top segments (GIFs)

Bunny / forest →
Spaceship / robots →
Snowy sword fight →

Live demo (replace with your links)

HTTP docs page: https://<PUBLIC_API_URL>/docs
Segment viewer: https://<PUBLIC_VIEWER_URL>:8501
On Lightning Studio, set ports 8000 (service) and 8501 (viewer) to Public.

How it works

Preprocess: ffmpeg writes frames to artifacts/frames/<asset>/*.jpg and mono 16 kHz audio.
Speech-to-text: Whisper outputs artifacts/transcripts/<asset>.json.
Embeddings: OpenCLIP image (frames) and OpenCLIP text (transcript), fused by weight alpha, then L2-normalized.
Indexing: Faiss inner-product (cosine on normalized vectors).
- Asset level: one vector per video → artifacts/faiss_hnsw.index, artifacts/meta.jsonl.
- Segment level: 30 s windows → artifacts/faiss_segments.index, artifacts/meta_segments.jsonl.
Serving: FastAPI endpoints for whole assets and segments.
Viewing: Streamlit app to type a query, preview top segments, jump to timestamps, and export GIFs.

Quickstart

Run the HTTP service:

export PYTHONPATH=.
uvicorn src.serve.app:app --host 0.0.0.0 --port 8000
# then open http://127.0.0.1:8000/docs (or your Studio public URL)

Send search requests:

curl -s http://127.0.0.1:8000/healthz | python -m json.tool
curl -sG --data-urlencode 'q=bunny in the forest' http://127.0.0.1:8000/query | python -m json.tool
curl -sG --data-urlencode 'q=spaceship with robots' http://127.0.0.1:8000/query_segments | python -m json.tool

Run the viewer:

streamlit run apps/segments_viewer.py --server.address 0.0.0.0 --server.port 8501

HTTP endpoints

GET /healthz → { ok, assets, asset_index_ntotal, segments, segment_index_ntotal }
GET /query?q=...&k=... → top-K whole-asset matches with similarity
GET /query_segments?q=...&k=... → top-K segment matches with start/end timestamps

Repo layout

src/util/ - ffmpeg, preprocessing, whisper
src/embed/ - CLIP embedder
src/index/ - builders for asset and segment indexes
src/serve/app.py - FastAPI service (healthz, whole-asset search, segment search)
apps/ - Streamlit viewer
artifacts/ - frames, audio, transcripts, meta files, Faiss indexes
docs/ - screenshots, contact sheets, GIFs, latency chart

Tuning

alpha (vision vs. text): higher favors visuals; lower favors transcript
seg_seconds: 15–60 s windows (smaller = sharper matches, larger index)
frame_step: 2–3 to subsample frames inside windows for speed
CLIP backbone: ViT-B/32 (faster) ↔ ViT-L/14 (higher quality)

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
apps		apps
docs		docs
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Content Index - Multimodal Segment Search (OpenCLIP + Whisper + Faiss)

Screenshots

Sample results

Live demo (replace with your links)

How it works

Quickstart

HTTP endpoints

Repo layout

Tuning

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Content Index - Multimodal Segment Search (OpenCLIP + Whisper + Faiss)

Screenshots

Sample results

Live demo (replace with your links)

How it works

Quickstart

HTTP endpoints

Repo layout

Tuning

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages