Hi, team.
I am very greatful you provide the code and data splits for your CPC audio paper "https: //arXiv.org/abs/2002.02848".
First I tried to pretrain Mod. CPC on libri-100 and frozen the features for common voice 1-hour ASR task, I got avg per of 45.2% on 5 languages (es, fr, it, ru, tt), which is reported as 43.9% in your paper (Table 3), I think my results is close (-1.3%) to what you reported, which seemed reasonable.
But when I test the pre-trained features on 5-hour common voice ASR tasks (es, fr, it, ru, tt), I just got a avg per (frozen features) of 42.5%, which had a big gap (-3.7%) with the reported per (38.8%, Table 5 in paper); when finetuning features, the gap was even bigger, the avg per was 37.2% (in the paper it is reported as 31.0%).
Unfortunately, the 5-hour common voice ASR experiments also perform badly when training from scratch, a avg per of 43.2%, far behind 38.3% reported in your paper.
I will be very thankful if you kindly provide more detailed hyper-parameters to help me reproduce your results.
Especially, I noticed you have set a optional argument --LSTM in ./eval/common_voices_eval.py to add a LSTM layer before the linear softmax layer. I think it would significantly increase the model capacity and may lead to better performance, did you use it?
Thnk you very much!
For now I used the default hyper-parameters on common voice ASR transfer experiments:
--batchSize 8
--lr 2e-4
--nEpoch 30
--kernelSize 8
......