A fully local, privacy-first app that helps you semantically search through videos using natural language. Simply upload a video, and we’ll do the heavy lifting to process frames, generate descriptions using AI, embed them, and allow fast search.
- 🔍 Semantic search: Find scenes by describing them in plain English
- 📤 Drag-and-drop video upload
- 🖼️ Instant timestamps & thumbnails for search results
- ⚡ Real-time resumable processing updates
- 🔒 100% local, privacy-first architecture
- Land on a clean dashboard with a sidebar listing all uploaded videos.
- Browse or search through your video collection easily.
- Drag and drop a video file for processing.
- The backend kicks off a background job to:
- Extract frames using
ffmpeg - Generate descriptions via LLaVA
- Generate vector embeddings of those descriptions for semantic search
- Extract frames using
- Real-time progress updates
- Updates are streamed live via WebSockets.
- Progress persists across reloads using Redis Pub/Sub for state sync
- Enter a query like "Where is the elephant?" or "Chef chopping onions"
- The app performs a vector similarity search against frame captions.
- You’ll get timestamps + thumbnails of the best matching moments in the video.
Video Query AI follows a modular architecture built by:
- Frontend: React + TypeScript with Vite and React Router
- Backend: FastAPI serving REST and WebSocket endpoints
- Job Queue: Redis + RQ for background processing
- Embedding Store: ChromaDB for vector search
- Realtime Updates: WebSockets with Redis Pub/Sub for progress tracking and resumable streams
Here’s what happens behind the scenes when you upload a video:
- Upload
- File is saved to disk and its metadata is stored in ChromaDB.
- Job Queuing
- A video processing job is pushed to a Redis Queue and handled asynchronously by a worker.
- Frame Extraction
- Frames are extracted from the video using
ffmpeg.
- Frames are extracted from the video using
- Frame Analysis
- Each frame is sent through LLaVA (via Ollama) to describe it.
- Description is embedded into a vector using a sentence transformer.
- Storage
- Vector embeddings + metadata is stored in ChromaDB.
- Progress Updates
- Real-time progress is sent to the frontend via WebSockets + Redis PubSub.
Users can search across:
- All uploaded videos
- A single selected video
When a query is made:
- The backend embeds the query using the same embedding model.
- A vector similarity search is performed in ChromaDB.
- Top 10 closest matches (timestamps + thumbnails) are returned.
Even if the user refreshes the page mid-processing:
- The frontend reconnects via WebSocket.
- The backend reads the current job state from Redis and resumes updates seamlessly.





