We study the temporal directionality problem in Grounding Events in Videos. Specifically, we enable Vision-Language Models to capture the intrinsic temporal structure of events by distinguishing between time-sensitive and time-insensitive semantics. In this work, we utilize a reinforcement learning framework to optimize the model's policy and design a temporal directionality reward to ensure the effective discrimination of event validity across forward and reversed videos.
ArrowGEV builds on Qwen2.5-VL and optimizes a policy with Group Relative Policy Optimization (GRPO) using a temporal directionality reward. For each training sample, we generate predictions on both the forward video and its reversed counterpart and compute:
- An IoU reward that measures how well the predicted window matches the ground-truth window.
- A temporal-directionality reward that penalizes windows that remain valid under time reversal for time-sensitive events, and rewards consistent windows for time-insensitive events. Sensitivity is taken directly from the pre-annotated
sensitivefield in the training data. - A format reward that enforces the
<think>...</think><answer>start to end</answer>output structure.
The trainer is implemented in src/arrowgev/rl/arrowgev_trainer.py.
Clone the repository and create a fresh environment:
git clone https://github.com/Yu-Fangxu/ArrowGEV.git
cd ArrowGEV
conda create -n ArrowGEV python=3.10.12 -y
conda activate ArrowGEV
pip install -r requirements.txtKey pinned versions (CUDA 12.4): torch==2.6.0, transformers==4.51.1, vllm==0.8.4, trl==0.17.0, numba==0.61.2. See docs/INSTALL.md for details — these versions matter for both training and vLLM inference.
Follow docs/DATA.md to download and organize the training data. The default layout expected by the scripts is:
dataset/
└── ArrowGEV/
├── annotations/train_2k5.json
└── videos/arrowgev_data/
Annotations and videos are published as ParadiseYu/ArrowGEV-Data, and can also be reassembled from the original source datasets (VTG-IT, TimeIT, HTStep, LongVid).
Reversed-video copies are required for the temporal directionality reward. Generate them once with:
python reverse_video.py \
--input_folder dataset/ArrowGEV/videos/arrowgev_data \
--output_folder dataset/ArrowGEV/videos/arrowgev_databash scripts/posttrain/train.shIf you find our work useful, please consider citing:
@article{yu2026arrowgev,
title={ArrowGEV: Grounding Events in Video via Learning the Arrow of Time},
author={Yu, Fangxu and Lu, Ziyao and Niu, Liqiang and Meng, Fandong and Zhou, Jie},
journal={arXiv preprint arXiv:2601.06559},
year={2026}
}