Hi,
I was trying to replicate the finetuning of the QWEN2.5-3B-Instruct model . I tried to evaluate on the mgsm dataset after training but I obtained a performance difference of 14 points between the released model and the model that I trained.
I used the same setup as the one shared in the sft.sh script.
I was wondering if you used a different hyperparameters setup for training the smaller models released, and if you did I'd appreciate if you can share it with me.
Thanks