This repository contains the official implementation for the paper: "From Moments to Meanings: Egocentric Temporal Localization and Answer Generation with VSLNet and Video-LLaVA on Ego4D".
Authors: Arian Mohammadi, Hassine El Ghazel, Hadi Abdallah, Ali Ayoub.
Egocentric vision focuses on understanding videos captured from a first-person perspective, posing unique challenges such as rapid camera motion and a restricted field of view. The Ego4D dataset provides a comprehensive benchmark for this domain through the Natural Language Queries (NLQ) task, which requires retrieving relevant video segments based on textual queries.
In this work, we evaluate the Video Span Localizing Network (VSLNet) and its simplified variant, VSLBase, under various configurations of video and text feature extractors—including Omnivore and EgoVLP for visual features, and GloVe and BERT for language embeddings. To extend egocentric video understanding beyond temporal localization, we propose a two-stage pipeline that combines segment retrieval with Video-LLaVA for natural language answer generation, enabling both fine-grained localization and semantic reasoning. We then evaluate how well Video-LLaVA performs compared to other multimodal natural language models.
Our core contribution is a two-stage pipeline that integrates temporal localization with generative question answering:
- Stage 1: Temporal Localization with VSLNet
- Given a long, untrimmed egocentric video and a natural language query, we use VSLNet to predict the start and end timestamps of the most relevant segment.
- We experiment with different feature extractors to find the optimal combination for this task.
- Stage 2: Answer Generation with Video-LLaVA
- The best 200 localized video segments from Stage 1 are fed into Video-LLaVA.
- This powerful multimodal model then generates a descriptive, human-like answer to the initial query based on the content of the clip.
- Comparaison of Video-LLaVA with other commerical models A visual representation of our two-stage approach.
- Robust Temporal Localization: Implementation of VSLNet and VSLBase for the Ego4D NLQ task.
- Modular Feature Extraction: Easily configurable backbones, supporting:
- Visual Features: Omnivore, EgoVLP
- Text Features: GloVe, BERT
- Generative Q&A: Integration with Video-LLaVA to move beyond simple localization and provide semantic understanding.
- Comprehensive Evaluation: Detailed analysis of various model and feature combinations on the Ego4D benchmark.
This project uses the Ego4D dataset, specifically focusing on the Natural Language Queries (NLQ) challenge.
You will need to download the dataset, including the videos and annotations, from the official website: ego4d-data.org.
Place the downloaded data in a data/ directory or update the paths in the configuration files accordingly.
Our experiments evaluate the performance of both stages of our pipeline: Temporal Localization and Answer Generation.
We tested various combinations of visual (EgoVLP, Omnivore) and textual (BERT, GloVe) feature extractors with VSLNet and its simpler variant, VSLBase.
Key Findings:
- BERT vs. GloVe: Across all model configurations, using BERT embeddings consistently and significantly outperforms GloVe. This highlights the importance of contextualized word embeddings for understanding natural language queries in this task.
- EgoVLP vs. Omnivore: The EgoVLP visual features consistently yield better results than Omnivore, demonstrating its superior capability in capturing the nuances of egocentric video.
- Best Performing Model: The combination of VSLNET + egoVLP with BERT embeddings achieves the highest scores across all metrics, making it our top-performing model for temporal localization.
Below is a summary of our results:
| Model | Embedding | Rank1@0.3 | Rank1@0.5 | Rank3@0.5 | mIoU |
|---|---|---|---|---|---|
| EgoVLP + VSLNet | BERT | 8.57 | 5.16 | 9.09 | 6.65 |
| EgoVLP + VSLNet | GloVe | 5.24 | 3.28 | 6.04 | 4.32 |
| EgoVLP + VSLBase | BERT | 6.12 | 3.87 | 6.50 | 4.98 |
| EgoVLP + VSLBase | GloVe | 4.78 | 2.97 | 5.21 | 3.71 |
| Omnivore + VSLNet | BERT | 6.43 | 3.74 | 6.38 | 4.96 |
| Omnivore + VSLNet | GloVe | 4.21 | 2.27 | 4.49 | 3.52 |
| Omnivore + VSLBase | BERT | 5.50 | 3.33 | 6.09 | 4.65 |
| Omnivore + VSLBase | GloVe | 3.51 | 1.81 | 3.77 | 3.05 |
For the second stage, we evaluated the quality of answers generated by different models. We compared Video-LLaVA against several strong baselines, including variants of Google's Gemma (G1.5F, G1.5P, G2.5F, G2.5P) and GPT-4o.
Key Findings:
- The models show varied performance across different metrics.
- G1.5F/P variants excel in ROUGE-1,3-4 scores, BLEU3-4 score, indicating better performance in terms of recall and n-gram overlap with ground-truth answers.
- G2.5P achieves the highest BLEU-1-2 scorem, ROUGE-2 and BERTScore F1.
- Our integrated Video-LLaVA model demonstrates a strong balance, achieving the highest BERTScore Precision, which measures semantic similarity, indicating that its generated answers are contextually very relevant.
| Metric | Variant | LLaVA | G1.5F | G1.5P | G2.5F | G2.5P | GPT-4o |
|---|---|---|---|---|---|---|---|
| BLEU | BLEU-1 | 0.2558 | 0.2684 | 0.3115 | 0.2889 | 0.3256 | 0.2785 |
| BLEU-2 | 0.1607 | 0.1677 | 0.1992 | 0.1741 | 0.2058 | 0.1691 | |
| BLEU-3 | 0.0901 | 0.1023 | 0.1214 | 0.0995 | 0.1191 | 0.0955 | |
| BLEU-4 | 0.0333 | 0.0419 | 0.0524 | 0.0449 | 0.0507 | 0.0434 | |
| ROUGE | R-1 | 0.2825 | 0.2852 | 0.3449 | 0.3097 | 0.3420 | 0.3073 |
| R-2 | 0.1280 | 0.1047 | 0.1373 | 0.1065 | 0.1442 | 0.1047 | |
| R-L | 0.2799 | 0.2819 | 0.3439 | 0.3080 | 0.3359 | 0.3030 | |
| R-Lsum | 0.2800 | 0.2798 | 0.3456 | 0.3062 | 0.3374 | 0.3031 | |
| BERTScore | Precision | 0.8964 | 0.8871 | 0.8906 | 0.8859 | 0.8938 | 0.8824 |
| Recall | 0.8851 | 0.8923 | 0.8996 | 0.8910 | 0.8975 | 0.8952 | |
| F1 Score | 0.8904 | 0.8893 | 0.8946 | 0.8880 | 0.8952 | 0.8883 |
This project builds upon the fantastic work of several research teams. We would like to thank:
- The creators of the Ego4D dataset.
- The authors of [VSLNet] (https://github.com/26hzhang/VSLNet).
- The developers of Video-LLaVA.
- The teams behind Omnivore, EgoVLP, BERT, and GloVe.