A tiny media search engine. It ingests videos, extracts 1 fps frames and audio, transcribes speech to text, embeds visuals and text, builds a vector index, and serves semantic search.
Bonus: segment-level retrieval (30-second windows) and a Streamlit viewer with thumbnails, jump-to-time, and GIF export.
Top segments (GIFs)
- HTTP docs page: https://<PUBLIC_API_URL>/docs
- Segment viewer: https://<PUBLIC_VIEWER_URL>:8501
On Lightning Studio, set ports 8000 (service) and 8501 (viewer) to Public.
- Preprocess: ffmpeg writes frames to
artifacts/frames/<asset>/*.jpgand mono 16 kHz audio. - Speech-to-text: Whisper outputs
artifacts/transcripts/<asset>.json. - Embeddings: OpenCLIP image (frames) and OpenCLIP text (transcript), fused by weight alpha, then L2-normalized.
- Indexing: Faiss inner-product (cosine on normalized vectors).
- Asset level: one vector per video →
artifacts/faiss_hnsw.index,artifacts/meta.jsonl. - Segment level: 30 s windows →
artifacts/faiss_segments.index,artifacts/meta_segments.jsonl.
- Asset level: one vector per video →
- Serving: FastAPI endpoints for whole assets and segments.
- Viewing: Streamlit app to type a query, preview top segments, jump to timestamps, and export GIFs.
Run the HTTP service:
export PYTHONPATH=.
uvicorn src.serve.app:app --host 0.0.0.0 --port 8000
# then open http://127.0.0.1:8000/docs (or your Studio public URL)
Send search requests:
curl -s http://127.0.0.1:8000/healthz | python -m json.tool
curl -sG --data-urlencode 'q=bunny in the forest' http://127.0.0.1:8000/query | python -m json.tool
curl -sG --data-urlencode 'q=spaceship with robots' http://127.0.0.1:8000/query_segments | python -m json.tool
Run the viewer:
streamlit run apps/segments_viewer.py --server.address 0.0.0.0 --server.port 8501
- GET
/healthz→{ ok, assets, asset_index_ntotal, segments, segment_index_ntotal } - GET
/query?q=...&k=...→ top-K whole-asset matches with similarity - GET
/query_segments?q=...&k=...→ top-K segment matches with start/end timestamps
src/util/- ffmpeg, preprocessing, whispersrc/embed/- CLIP embeddersrc/index/- builders for asset and segment indexessrc/serve/app.py- FastAPI service (healthz, whole-asset search, segment search)apps/- Streamlit viewerartifacts/- frames, audio, transcripts, meta files, Faiss indexesdocs/- screenshots, contact sheets, GIFs, latency chart
alpha(vision vs. text): higher favors visuals; lower favors transcriptseg_seconds: 15–60 s windows (smaller = sharper matches, larger index)frame_step: 2–3 to subsample frames inside windows for speed- CLIP backbone: ViT-B/32 (faster) ↔ ViT-L/14 (higher quality)


