From Moments to Meanings: Egocentric Vision

This repository contains the official implementation for the paper: "From Moments to Meanings: Egocentric Temporal Localization and Answer Generation with VSLNet and Video-LLaVA on Ego4D".
Authors: Arian Mohammadi, Hassine El Ghazel, Hadi Abdallah, Ali Ayoub.

📜 Abstract

Egocentric vision focuses on understanding videos captured from a first-person perspective, posing unique challenges such as rapid camera motion and a restricted field of view. The Ego4D dataset provides a comprehensive benchmark for this domain through the Natural Language Queries (NLQ) task, which requires retrieving relevant video segments based on textual queries.
In this work, we evaluate the Video Span Localizing Network (VSLNet) and its simplified variant, VSLBase, under various configurations of video and text feature extractors—including Omnivore and EgoVLP for visual features, and GloVe and BERT for language embeddings. To extend egocentric video understanding beyond temporal localization, we propose a two-stage pipeline that combines segment retrieval with Video-LLaVA for natural language answer generation, enabling both fine-grained localization and semantic reasoning. We then evaluate how well Video-LLaVA performs compared to other multimodal natural language models.

🚀 Project Pipeline

Our core contribution is a two-stage pipeline that integrates temporal localization with generative question answering:

Stage 1: Temporal Localization with VSLNet
- Given a long, untrimmed egocentric video and a natural language query, we use VSLNet to predict the start and end timestamps of the most relevant segment.
- We experiment with different feature extractors to find the optimal combination for this task.
Stage 2: Answer Generation with Video-LLaVA
- The best 200 localized video segments from Stage 1 are fed into Video-LLaVA.
- This powerful multimodal model then generates a descriptive, human-like answer to the initial query based on the content of the clip.
- Comparaison of Video-LLaVA with other commerical models A visual representation of our two-stage approach.

✨ Key Features

Robust Temporal Localization: Implementation of VSLNet and VSLBase for the Ego4D NLQ task.
Modular Feature Extraction: Easily configurable backbones, supporting:
- Visual Features: Omnivore, EgoVLP
- Text Features: GloVe, BERT
Generative Q&A: Integration with Video-LLaVA to move beyond simple localization and provide semantic understanding.
Comprehensive Evaluation: Detailed analysis of various model and feature combinations on the Ego4D benchmark.

📊 Dataset

This project uses the Ego4D dataset, specifically focusing on the Natural Language Queries (NLQ) challenge.
You will need to download the dataset, including the videos and annotations, from the official website: ego4d-data.org.
Place the downloaded data in a data/ directory or update the paths in the configuration files accordingly.

📈 Results

Our experiments evaluate the performance of both stages of our pipeline: Temporal Localization and Answer Generation.

Temporal Localization Performance

We tested various combinations of visual (EgoVLP, Omnivore) and textual (BERT, GloVe) feature extractors with VSLNet and its simpler variant, VSLBase.
Key Findings:

BERT vs. GloVe: Across all model configurations, using BERT embeddings consistently and significantly outperforms GloVe. This highlights the importance of contextualized word embeddings for understanding natural language queries in this task.
EgoVLP vs. Omnivore: The EgoVLP visual features consistently yield better results than Omnivore, demonstrating its superior capability in capturing the nuances of egocentric video.
Best Performing Model: The combination of VSLNET + egoVLP with BERT embeddings achieves the highest scores across all metrics, making it our top-performing model for temporal localization.

Below is a summary of our results:

Model	Embedding	Rank1@0.3	Rank1@0.5	Rank3@0.5	mIoU
EgoVLP + VSLNet	BERT	8.57	5.16	9.09	6.65
EgoVLP + VSLNet	GloVe	5.24	3.28	6.04	4.32
EgoVLP + VSLBase	BERT	6.12	3.87	6.50	4.98
EgoVLP + VSLBase	GloVe	4.78	2.97	5.21	3.71
Omnivore + VSLNet	BERT	6.43	3.74	6.38	4.96
Omnivore + VSLNet	GloVe	4.21	2.27	4.49	3.52
Omnivore + VSLBase	BERT	5.50	3.33	6.09	4.65
Omnivore + VSLBase	GloVe	3.51	1.81	3.77	3.05

Answer Generation Performance

For the second stage, we evaluated the quality of answers generated by different models. We compared Video-LLaVA against several strong baselines, including variants of Google's Gemma (G1.5F, G1.5P, G2.5F, G2.5P) and GPT-4o.
Key Findings:

The models show varied performance across different metrics.
G1.5F/P variants excel in ROUGE-1,3-4 scores, BLEU3-4 score, indicating better performance in terms of recall and n-gram overlap with ground-truth answers.
G2.5P achieves the highest BLEU-1-2 scorem, ROUGE-2 and BERTScore F1.
Our integrated Video-LLaVA model demonstrates a strong balance, achieving the highest BERTScore Precision, which measures semantic similarity, indicating that its generated answers are contextually very relevant.

Metric	Variant	LLaVA	G1.5F	G1.5P	G2.5F	G2.5P	GPT-4o
BLEU	BLEU-1	0.2558	0.2684	0.3115	0.2889	0.3256	0.2785
	BLEU-2	0.1607	0.1677	0.1992	0.1741	0.2058	0.1691
	BLEU-3	0.0901	0.1023	0.1214	0.0995	0.1191	0.0955
	BLEU-4	0.0333	0.0419	0.0524	0.0449	0.0507	0.0434
ROUGE	R-1	0.2825	0.2852	0.3449	0.3097	0.3420	0.3073
	R-2	0.1280	0.1047	0.1373	0.1065	0.1442	0.1047
	R-L	0.2799	0.2819	0.3439	0.3080	0.3359	0.3030
	R-Lsum	0.2800	0.2798	0.3456	0.3062	0.3374	0.3031
BERTScore	Precision	0.8964	0.8871	0.8906	0.8859	0.8938	0.8824
	Recall	0.8851	0.8923	0.8996	0.8910	0.8975	0.8952
	F1 Score	0.8904	0.8893	0.8946	0.8880	0.8952	0.8883

🙏 Acknowledgments

This project builds upon the fantastic work of several research teams. We would like to thank:

The creators of the Ego4D dataset.
The authors of [VSLNet] (https://github.com/26hzhang/VSLNet).
The developers of Video-LLaVA.
The teams behind Omnivore, EgoVLP, BERT, and GloVe.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
Extension		Extension
model		model
notebooks		notebooks
utils		utils
.DS_Store		.DS_Store
.gitignore		.gitignore
EgocentricVisionReport2025.pdf		EgocentricVisionReport2025.pdf
README.md		README.md
main.py		main.py
options.py		options.py
requirements.txt		requirements.txt
run_train.sh		run_train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

From Moments to Meanings: Egocentric Vision

📜 Abstract

🚀 Project Pipeline

✨ Key Features

📊 Dataset

📈 Results

Temporal Localization Performance

Answer Generation Performance

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

From Moments to Meanings: Egocentric Vision

📜 Abstract

🚀 Project Pipeline

✨ Key Features

📊 Dataset

📈 Results

Temporal Localization Performance

Answer Generation Performance

🙏 Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages