Training-free Online Video Step Grounding

Luca Zanella, Massimiliano Mancini, Yiming Wang, Alessio Tonioni, Elisa Ricci

Abstract: Given a task and a set of steps composing it, Video Step Grounding (VSG) aims to detect which steps are performed in a video. Standard approaches for this task require a labeled training set (e.g., with step-level annotations or narrations), which may be costly to collect. Moreover, they process the full video offline, limiting their applications for scenarios requiring online decisions. Thus, in this work, we explore how to perform VSG online and without training. We achieve this by exploiting the zero-shot capabilities of recent Large Multimodal Models (LMMs). In particular, we use LMMs to predict the step associated with a restricted set of frames, without access to the whole video. We show that this online strategy without task-specific tuning outperforms offline and training-based models. Motivated by this finding, we develop Bayesian Grounding with Large Multimodal Models (BaGLM), further injecting knowledge of past frames into the LMM-based predictions. BaGLM exploits Bayesian filtering principles, modeling step transitions via (i) a dependency matrix extracted through large language models and (ii) an estimation of step progress. Experiments on three datasets show superior performance of BaGLM over state-of-the-art training-based offline methods.

Installation

We recommend using a Linux machine with CUDA-compatible GPUs. All experiments were run on a single NVIDIA H100 64GB GPU, except for LLaMA3-70B-Instruct, which requires 4× H100 GPUs. We provide a Conda environment to configure the required libraries. The code was tested with Python 3.13.

Clone the repo with:

git clone https://github.com/lucazanella/baglm.git
cd baglm

Create and set up the Conda environment

conda create -n baglm python=3.13 -y
conda activate baglm
pip install --upgrade pip setuptools wheel

# PyTorch + CUDA
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126
pip install torchcodec==0.2.1+cu126 --index-url=https://download.pytorch.org/whl/cu126

# Other dependencies
pip install -r requirements.txt
python -m spacy download en_core_web_sm

FFmpeg (required for video decoding)

BaGLM uses ffmpeg for GPU-accelerated video decoding. We recommend installing the prebuilt shared build from BtbN/FFmpeg-Builds.

# 1) Download
wget https://github.com/BtbN/FFmpeg-Builds/releases/download/latest/ffmpeg-n7.1-latest-linux64-gpl-shared-7.1.tar.xz

# 2) Uncompress and move it to $HOME/opt
mkdir -p $HOME/opt
tar -xf ffmpeg-n7.1-latest-linux64-gpl-shared-7.1.tar.xz
mv ffmpeg-n7.1-latest-linux64-gpl-shared-7.1 $HOME/opt/

# 3) Create wrappers so ffmpeg/ffprobe use the correct libraries
mkdir -p $HOME/.local/bin
tee $HOME/.local/bin/ffmpeg > /dev/null << 'EOF'
#!/bin/bash
LD_LIBRARY_PATH=$HOME/opt/ffmpeg-n7.1-latest-linux64-gpl-shared-7.1/lib:$LD_LIBRARY_PATH \
    $HOME/opt/ffmpeg-n7.1-latest-linux64-gpl-shared-7.1/bin/ffmpeg "$@"
EOF
chmod +x $HOME/.local/bin/ffmpeg

tee $HOME/.local/bin/ffprobe > /dev/null << 'EOF'
#!/bin/bash
LD_LIBRARY_PATH=$HOME/opt/ffmpeg-n7.1-latest-linux64-gpl-shared-7.1/lib:$LD_LIBRARY_PATH \
    $HOME/opt/ffmpeg-n7.1-latest-linux64-gpl-shared-7.1/bin/ffprobe "$@"
EOF
chmod +x $HOME/.local/bin/ffprobe

export PATH="$HOME/.local/bin:$PATH"

# 4) Set up Conda hooks
CONDA_PREFIX="$HOME/miniconda3/envs/baglm"

mkdir -p "$CONDA_PREFIX/etc/conda/activate.d" "$CONDA_PREFIX/etc/conda/deactivate.d"

tee "$CONDA_PREFIX/etc/conda/activate.d/env_vars.sh" > /dev/null << 'EOF'
export LD_LIBRARY_PATH=/opt/ffmpeg-n7.1-latest-linux64-gpl-shared-7.1/lib:$LD_LIBRARY_PATH
EOF

tee "$CONDA_PREFIX/etc/conda/deactivate.d/env_vars.sh" > /dev/null << 'EOF'
unset LD_LIBRARY_PATH
EOF

# 5) Reactivate your environment
conda deactivate && conda activate baglm

You can verify the installation with:

ffmpeg -version
ffprobe -version

Models

Our method is implemented using InternVL2.5-8B as our large multimodal model (LMM). We also provide support for InternVL3-8b, LLaVA-OV-7b, and Qwen2.5-VL-7b.

Note: Models will download automatically when you first use them if your machine has internet access. For offline environments, download manually from Hugging Face using snapshot_download.

Datasets

We evaluate our method on three public datasets: CrossTask, HT-Step, and Ego4D Goal-Step. We also include support for the COIN dataset. Please refer to the official repositories for instructions on how to download the videos and annotations.

Note: For COIN, an example script to download videos from YouTube using yt-dlp is provided at src/download_videos_yt_dlp.py.

Please download the data, including video step grounding scores, progress scores, and prerequisites probabilities, for the datasets below:

Dataset	Link
HT-Step	Google Drive
CrossTask	Google Drive
Ego4D Goal-Step	Google Drive
COIN	Google Drive

and place them in the respective dataset folders.

Evaluation

We provide scripts to generate predictions and to evaluate them with BaGLM in the scripts folder.

1. Data generation scripts (optional)

These scripts generate the LMM/LLM-based predictions: video step grounding scores, progress scores, and prerequisite probabilities. You can skip this step if you use the precomputed data provided in the Google Drive folders.

Dataset	Prediction Type	Script
HT-Step	Video step grounding	`htstep_run_lmm_vsg.sh`
	Progress	`htstep_run_lmm_prog.sh`
	Prerequisite probabilities	`htstep_run_llm_prereq.sh`
CrossTask	Video step grounding	`crosstask_run_lmm_vsg.sh`
	Progress	`crosstask_run_lmm_prog.sh`
	Prerequisite probabilities	`crosstask_run_llm_prereq.sh`
Ego4D Goal-Step	Video step grounding	`ego4d_goalstep_run_lmm_vsg.sh`
	Progress	`ego4d_goalstep_run_lmm_prog.sh`
	Prerequisite probabilities	`ego4d_goalstep_run_llm_prereq.sh`
COIN	Video step grounding	`coin_run_lmm_vsg.sh`
	Progress	`coin_run_lmm_prog.sh`
	Prerequisite probabilities	`coin_run_llm_prereq.sh`

2. Evaluation scripts

These scripts run BaGLM on the LMM/LLM predictions to compute the final metrics reported in the paper.

Dataset	Script
HT-Step	`htstep_run_bayes_filter.sh`
CrossTask	`crosstask_run_bayes_filter.sh`
Ego4D Goal-Step	`ego4d_goalstep_run_bayes_filter.sh`
COIN	`coin_run_bayes_filter.sh`

Note: These scripts require the LMM/LLM predictions, either from Google Drive or generated using the scripts above.

Acknowledgements

This repository builds upon the t2v_metrics codebase. Huge thanks to the authors!

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
datasets		datasets
docs		docs
media		media
prompts		prompts
scripts		scripts
src		src
t2v_metrics		t2v_metrics
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Training-free Online Video Step Grounding

Installation

Create and set up the Conda environment

FFmpeg (required for video decoding)

Models

Datasets

Evaluation

1. Data generation scripts (optional)

2. Evaluation scripts

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

lucazanella/baglm

Folders and files

Latest commit

History

Repository files navigation

Training-free Online Video Step Grounding

Installation

Create and set up the Conda environment

FFmpeg (required for video decoding)

Models

Datasets

Evaluation

1. Data generation scripts (optional)

2. Evaluation scripts

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages