-
Notifications
You must be signed in to change notification settings - Fork 766
Description
Description
Hi Niklas(@Muennighoff),
I'm trying to reproduce s1/s1.1 evaluation results using newer vLLM versions so I can evaluate newer instruct and base models that aren't supported on the original repo's vLLM (~0.6.x). To do this, I modified causal_vllms and the corresponding utils files in the tasks directory.
Two problems:
- AIME scores deviate from published values.
- Budget forcing appears to have no effect in most configurations.
I can reproduce the exact published scores using the original unmodified repository, so this is specific to my vLLM migration.
Published reference scores from the paper
| Model | MATH500 | GPQA | AIME24 | AIME25 |
|---|---|---|---|---|
| s1 w/o BF | 92.6 | 56.6 | 50.0 | 26.7 |
| s1 BF "Wait" 1x | 92.8 | 59.6 | 53.3 | 30.0 |
| s1 BF "Wait" 2x | 93.0 | 59.6 | 53.3 | 33.3 |
| s1.1 w/o BF | 94.4 | 60.6 | 56.7 | 50.0 |
| s1.1 BF "Wait" 1x | 95.4 | 62.6 | 56.7 | 50.0 |
| s1.1 BF "Wait" 2x | 95.4 | 63.6 | 56.7 | 50.0 |
My s1.1 results (modified vLLM)
| Config | AIME24 | AIME25 | GPQA | MATH |
|---|---|---|---|---|
| Paper w/o BF | 56.7 | 50.0 | 60.6 | 94.4 |
| Paper BF 1x | 56.7 | 50.0 | 62.6 | 95.4 |
| Paper BF 2x | 56.7 | 50.0 | 63.6 | 95.4 |
| My Auto (no BF) | 70.0 | 43.3 | 63.1 | 94.2 |
| My Ignore 1 wait | 70.0 | 43.3 | 63.1 | 94.2 |
| My Ignore 2 wait | 53.3 | 43.3 | 61.6 | 94.4 |
Key observations:
- Auto and Ignore 1 wait produce identical results, suggesting BF 1x is not actually changing the generation.
- AIME24 is 70.0 without BF (paper: 56.7), then drops to 53.3 with BF 2x — the opposite direction from expected.
- AIME25 consistently undershoots the paper's 50.0 at 43.3.
- GPQA and MATH are reasonably close to published values.
My s1 results (modified vLLM)
| Config | AIME24 | AIME25 | GPQA | MATH |
|---|---|---|---|---|
| Paper w/o BF | 50.0 | 26.7 | 56.6 | 92.6 |
| Paper BF 1x | 53.3 | 30.0 | 59.6 | 92.8 |
| Paper BF 2x | 53.3 | 33.3 | 59.6 | 93.0 |
| My Auto (no BF) | 40.0 | 26.7 | 56.1 | 92.4 |
| My Ignore 1 wait | 40.0 | 26.7 | 56.1 | 92.6 |
| My Ignore 2 wait | 40.0 | 26.7 | 56.1 | 92.4 |
Key observations:
- All three configs produce essentially identical scores. Budget forcing has zero effect, confirming the wait injection is not functioning.
- AIME24 is 40.0 across all configs, 10 points below the paper's w/o BF (50.0).
- AIME25 matches the paper's w/o BF (26.7) but doesn't improve with BF as expected.
- GPQA and MATH are close to but slightly below published values.
What I've investigated so far
To debug this, I looked at the s2 branch which uses a local simple-verify library instead of the original chat-style extraction logic. Suspecting my budget forcing implementation might be incorrect, I tried bypassing it and using the original chat-style extraction approach directly, but still get incorrect AIME scores. This suggests the issue isn't solely in the extraction pipeline, and likely also involves how the generation itself differs on newer vLLM versions.
What I'm looking for
Given the symptoms, there appear to be two distinct issues:
1. Budget forcing is non-functional: The identical scores across Auto/Ignore 1/Ignore 2 (especially for s1) strongly suggest the "Wait" injection is not actually modifying the generation. Any guidance on how the end-of-thinking token suppression and wait appending should interact with newer vLLM versions would be helpful.
2. AIME score divergence even without BF: The s1 AIME24 score (40.0) is 10 points below the paper's w/o BF (50.0), while for s1.1 it overshoots (70.0 vs 56.7). MATH and GPQA are close in both cases. This points to either a generation-level difference in how newer vLLM handles these models, or a subtle difference in the AIME answer extraction for integer answers (000–999).
Any tips on debugging either of these would be appreciated. Since the original repo gives exact scores, the root cause is in my vLLM migration.