Inconsistent AIME scores when porting evaluation to newer vLLM versions

## Description
Hi Niklas(@Muennighoff),
I'm trying to reproduce s1/s1.1 evaluation results using newer vLLM versions so I can evaluate newer instruct and base models that aren't supported on the original repo's vLLM (~0.6.x). To do this, I modified `causal_vllms` and the corresponding utils files in the tasks directory.

**Two problems:**
1. AIME scores deviate from published values.
2. Budget forcing appears to have no effect in most configurations.

I can reproduce the exact published scores using the original unmodified repository, so this is specific to my vLLM migration.

## Published reference scores from the paper

| Model | MATH500 | GPQA | AIME24 | AIME25 |
|-------|---------|------|--------|--------|
| s1 w/o BF | 92.6 | 56.6 | 50.0 | 26.7 |
| s1 BF "Wait" 1x | 92.8 | 59.6 | 53.3 | 30.0 |
| s1 BF "Wait" 2x | 93.0 | 59.6 | 53.3 | 33.3 |
| s1.1 w/o BF | 94.4 | 60.6 | 56.7 | 50.0 |
| s1.1 BF "Wait" 1x | 95.4 | 62.6 | 56.7 | 50.0 |
| s1.1 BF "Wait" 2x | 95.4 | 63.6 | 56.7 | 50.0 |

## My s1.1 results (modified vLLM)

| Config | AIME24 | AIME25 | GPQA | MATH |
|--------|--------|--------|------|------|
| **Paper w/o BF** | **56.7** | **50.0** | **60.6** | **94.4** |
| **Paper BF 1x** | **56.7** | **50.0** | **62.6** | **95.4** |
| **Paper BF 2x** | **56.7** | **50.0** | **63.6** | **95.4** |
| My Auto (no BF) | 70.0 | 43.3 | 63.1 | 94.2 |
| My Ignore 1 wait | 70.0 | 43.3 | 63.1 | 94.2 |
| My Ignore 2 wait | 53.3 | 43.3 | 61.6 | 94.4 |

**Key observations:**
- Auto and Ignore 1 wait produce **identical** results, suggesting BF 1x is not actually changing the generation.
- AIME24 is 70.0 without BF (paper: 56.7), then drops to 53.3 with BF 2x — the opposite direction from expected.
- AIME25 consistently undershoots the paper's 50.0 at 43.3.
- GPQA and MATH are reasonably close to published values.

## My s1 results (modified vLLM)

| Config | AIME24 | AIME25 | GPQA | MATH |
|--------|--------|--------|------|------|
| **Paper w/o BF** | **50.0** | **26.7** | **56.6** | **92.6** |
| **Paper BF 1x** | **53.3** | **30.0** | **59.6** | **92.8** |
| **Paper BF 2x** | **53.3** | **33.3** | **59.6** | **93.0** |
| My Auto (no BF) | 40.0 | 26.7 | 56.1 | 92.4 |
| My Ignore 1 wait | 40.0 | 26.7 | 56.1 | 92.6 |
| My Ignore 2 wait | 40.0 | 26.7 | 56.1 | 92.4 |

**Key observations:**
- **All three configs produce essentially identical scores.** Budget forcing has zero effect, confirming the wait injection is not functioning.
- AIME24 is 40.0 across all configs, 10 points below the paper's w/o BF (50.0).
- AIME25 matches the paper's w/o BF (26.7) but doesn't improve with BF as expected.
- GPQA and MATH are close to but slightly below published values.

## What I've investigated so far

To debug this, I looked at the `s2` branch which uses a local `simple-verify` library instead of the original chat-style extraction logic. Suspecting my budget forcing implementation might be incorrect, I tried bypassing it and using the original chat-style extraction approach directly, but still get incorrect AIME scores. This suggests the issue isn't solely in the extraction pipeline, and likely also involves how the generation itself differs on newer vLLM versions.

## What I'm looking for

Given the symptoms, there appear to be two distinct issues:

**1. Budget forcing is non-functional:** The identical scores across Auto/Ignore 1/Ignore 2 (especially for s1) strongly suggest the "Wait" injection is not actually modifying the generation. Any guidance on how the end-of-thinking token suppression and wait appending should interact with newer vLLM versions would be helpful.

**2. AIME score divergence even without BF:** The s1 AIME24 score (40.0) is 10 points below the paper's w/o BF (50.0), while for s1.1 it overshoots (70.0 vs 56.7). MATH and GPQA are close in both cases. This points to either a generation-level difference in how newer vLLM handles these models, or a subtle difference in the AIME answer extraction for integer answers (000–999).

Any tips on debugging either of these would be appreciated. Since the original repo gives exact scores, the root cause is in my vLLM migration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent AIME scores when porting evaluation to newer vLLM versions #134

Description

Published reference scores from the paper

My s1.1 results (modified vLLM)

My s1 results (modified vLLM)

What I've investigated so far

What I'm looking for

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	MATH500	GPQA	AIME24	AIME25
s1 w/o BF	92.6	56.6	50.0	26.7
s1 BF "Wait" 1x	92.8	59.6	53.3	30.0
s1 BF "Wait" 2x	93.0	59.6	53.3	33.3
s1.1 w/o BF	94.4	60.6	56.7	50.0
s1.1 BF "Wait" 1x	95.4	62.6	56.7	50.0
s1.1 BF "Wait" 2x	95.4	63.6	56.7	50.0

Config	AIME24	AIME25	GPQA	MATH
Paper w/o BF	56.7	50.0	60.6	94.4
Paper BF 1x	56.7	50.0	62.6	95.4
Paper BF 2x	56.7	50.0	63.6	95.4
My Auto (no BF)	70.0	43.3	63.1	94.2
My Ignore 1 wait	70.0	43.3	63.1	94.2
My Ignore 2 wait	53.3	43.3	61.6	94.4

Config	AIME24	AIME25	GPQA	MATH
Paper w/o BF	50.0	26.7	56.6	92.6
Paper BF 1x	53.3	30.0	59.6	92.8
Paper BF 2x	53.3	33.3	59.6	93.0
My Auto (no BF)	40.0	26.7	56.1	92.4
My Ignore 1 wait	40.0	26.7	56.1	92.6
My Ignore 2 wait	40.0	26.7	56.1	92.4

Inconsistent AIME scores when porting evaluation to newer vLLM versions #134

Description

Description

Published reference scores from the paper

My s1.1 results (modified vLLM)

My s1 results (modified vLLM)

What I've investigated so far

What I'm looking for

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions