Skip to content

Questions about your benchmark result in the paper #10

@ArthurZhang02

Description

@ArthurZhang02

Thanks for your excellent work and for open‑sourcing the code to the community!

I have a few questions regarding the AIME24/25 and AMC23 benchmark results in Table 2 of your paper “Parallel-R1: Towards Parallel Thinking via Reinforcement Learning.”

  1. Regarding Qwen3‑4B (released 2025‑04): This model provides both a thinking mode and a non‑thinking mode. Which mode did you use during your evaluation and training?

  2. It seems that your reported results differ significantly from those in the Qwen3 technical report (https://arxiv.org/pdf/2505.09388). Their reported AIME24 and AIME25 scores are 25.0% and 19.1%, which are much higher than the values in your table (2.9% and 1.3%). Could this discrepancy be due to your evaluation settings (for example, a limited generation context length) or another factor?

  3. Based on the above and considering that your code uses max_token = 3000, I suspect that you may have evaluated and trained Qwen3‑4B in thinking mode, but with a generation length that is too short for its reasoning traces. Could this be the cause of the large performance gap?

Thanks again for your contributions!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions