Evaluation on list save

Hi, I am wondering what the reasoning behind the evaluation implemented in evaluateFromListSave is - it seems to me this is loading in 2 audio files, running the audio feature extractor on them, and computing the feature-wise cosine distance between them. Where is the video pipeline in this? How is this a good evaluation metric without using the visual stream?