Feature Extraction for EgoVideo Ego4d NLQ

On the technical report for EgoVideo on Ego4d NLQ, it is said that ViT-1B of EgoVideo is used to extract video feature for each snippet, which contains s = 16 consecutive frames with stride = 16.  But I think the 4 frame model that is released does not encode 16 frames. Could you elaborate more on how exactly the feature extraction was done?