-
Notifications
You must be signed in to change notification settings - Fork 16
Description
Thanks for your excellent work and for open‑sourcing the code to the community!
I have a few questions regarding the AIME24/25 and AMC23 benchmark results in Table 2 of your paper “Parallel-R1: Towards Parallel Thinking via Reinforcement Learning.”
-
Regarding Qwen3‑4B (released 2025‑04): This model provides both a thinking mode and a non‑thinking mode. Which mode did you use during your evaluation and training?
-
It seems that your reported results differ significantly from those in the Qwen3 technical report (https://arxiv.org/pdf/2505.09388). Their reported AIME24 and AIME25 scores are 25.0% and 19.1%, which are much higher than the values in your table (2.9% and 1.3%). Could this discrepancy be due to your evaluation settings (for example, a limited generation context length) or another factor?
-
Based on the above and considering that your code uses
max_token = 3000, I suspect that you may have evaluated and trained Qwen3‑4B in thinking mode, but with a generation length that is too short for its reasoning traces. Could this be the cause of the large performance gap?
Thanks again for your contributions!