I tried "GRPO_From_Scratch"—and learned a lot, Thanks!
A small issue: During training/inference, even after Qwen1.5 has reached the answer, it continues generating text.
. . . <answer>66</answer>Human: In a classroom there are 30 students who all need individual attention from the teacher due to special needs. The school has two types of chairs available - standard . . .
I tried this during training/inference on math tasks, and it usually had no impact. But for some tasks, it might affect reward calculation.
Have anyone considered how to prevent this?