Egocentric videos are long and unstructured, making information retrieval challenging. This project extends temporal localization by generating textual answers from relevant video segments, enabling efficient query processing and improving interpretability in video understanding.
✔ Two Model Architectures
- VSLBase (simplified) and VSLNet (with Query-Guided Highlighting)
- Supports both Omnivore and EgoVLP features
✔ End-to-End Pipeline
- Localizes relevant video segments
- Generates textual answers
- Evaluates with ROUGE/METEOR metrics
✔ Optimized for Egocentric Videos
- Handles long, unstructured first-person recordings
- Focuses computational resources on key moments