Skip to content

Yu-Fangxu/ArrowGEV

Repository files navigation

ArrowGEV: Grounding Events in Video via Learning the Arrow of Time

We study the temporal directionality problem in Grounding Events in Videos. Specifically, we enable Vision-Language Models to capture the intrinsic temporal structure of events by distinguishing between time-sensitive and time-insensitive semantics. In this work, we utilize a reinforcement learning framework to optimize the model's policy and design a temporal directionality reward to ensure the effective discrimination of event validity across forward and reversed videos.

ArrowGEV overview

Contents

Overview

ArrowGEV builds on Qwen2.5-VL and optimizes a policy with Group Relative Policy Optimization (GRPO) using a temporal directionality reward. For each training sample, we generate predictions on both the forward video and its reversed counterpart and compute:

  • An IoU reward that measures how well the predicted window matches the ground-truth window.
  • A temporal-directionality reward that penalizes windows that remain valid under time reversal for time-sensitive events, and rewards consistent windows for time-insensitive events. Sensitivity is taken directly from the pre-annotated sensitive field in the training data.
  • A format reward that enforces the <think>...</think><answer>start to end</answer> output structure.

The trainer is implemented in src/arrowgev/rl/arrowgev_trainer.py.

Setup

Clone the repository and create a fresh environment:

git clone https://github.com/Yu-Fangxu/ArrowGEV.git
cd ArrowGEV

conda create -n ArrowGEV python=3.10.12 -y
conda activate ArrowGEV

pip install -r requirements.txt

Key pinned versions (CUDA 12.4): torch==2.6.0, transformers==4.51.1, vllm==0.8.4, trl==0.17.0, numba==0.61.2. See docs/INSTALL.md for details — these versions matter for both training and vLLM inference.

Dataset

Follow docs/DATA.md to download and organize the training data. The default layout expected by the scripts is:

dataset/
└── ArrowGEV/
    ├── annotations/train_2k5.json
    └── videos/arrowgev_data/

Annotations and videos are published as ParadiseYu/ArrowGEV-Data, and can also be reassembled from the original source datasets (VTG-IT, TimeIT, HTStep, LongVid).

Reversed-video copies are required for the temporal directionality reward. Generate them once with:

python reverse_video.py \
    --input_folder  dataset/ArrowGEV/videos/arrowgev_data \
    --output_folder dataset/ArrowGEV/videos/arrowgev_data

Training

bash scripts/posttrain/train.sh

Citation

If you find our work useful, please consider citing:

@article{yu2026arrowgev,
  title={ArrowGEV: Grounding Events in Video via Learning the Arrow of Time},
  author={Yu, Fangxu and Lu, Ziyao and Niu, Liqiang and Meng, Fandong and Zhou, Jie},
  journal={arXiv preprint arXiv:2601.06559},
  year={2026}
}

About

[Findings of ACL 2026] ArrowGEV: Grounding Events in Video via Learning the Arrow of Time

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors