vison-language model training data example for videos

Hi, thank you for your cool work!

I've read data.md but still couldn't understand how to make a training dataset for training the vision-language model using videos.

Could anyone kindly share an example format of the training dataset?

Thanks