On the technical report for EgoVideo on Ego4d NLQ, it is said that ViT-1B of EgoVideo is used to extract video feature for each snippet, which contains s = 16 consecutive frames with stride = 16. But I think the 4 frame model that is released does not encode 16 frames. Could you elaborate more on how exactly the feature extraction was done?
On the technical report for EgoVideo on Ego4d NLQ, it is said that ViT-1B of EgoVideo is used to extract video feature for each snippet, which contains s = 16 consecutive frames with stride = 16. But I think the 4 frame model that is released does not encode 16 frames. Could you elaborate more on how exactly the feature extraction was done?