Hello,
I have a question related to the computation of xsim and xsim++ for Speech using FLEURS dataset, that you mentioned on your SONAR paper (Table 6).
In fact, when preparing the test-set of FLEURS, we noticed that there are many wav files for the same textual sentence (X-eng pairs). I am wondering how you computed the xsim and xsim++ scores, while the xsim code excepts X and Y (source and target) to have the same shape. Did you choose just one wav file randomly for each sentence, so that you get one wav file per sentence ? If not, how you would choose the right sentence if we have multiple audios that have the same english translation.
Thank you !!
Hello,
I have a question related to the computation of xsim and xsim++ for Speech using FLEURS dataset, that you mentioned on your SONAR paper (Table 6).
In fact, when preparing the test-set of FLEURS, we noticed that there are many wav files for the same textual sentence (X-eng pairs). I am wondering how you computed the xsim and xsim++ scores, while the xsim code excepts X and Y (source and target) to have the same shape. Did you choose just one wav file randomly for each sentence, so that you get one wav file per sentence ? If not, how you would choose the right sentence if we have multiple audios that have the same english translation.
Thank you !!