Hi, thank you for your cool work! I've read data.md but still couldn't understand how to make a training dataset for training the vision-language model using videos. Could anyone kindly share an example format of the training dataset? Thanks