Add fixed-step and fixed-compute evaluation benchmarks for hardware accessibility

# [Feature Request] Add fixed-step and fixed-compute evaluation benchmarks for hardware accessibility

## Motivation

Currently, Parameter Golf evaluation primarily relies on H100 GPU availability with a 10-minute wall-clock time limit. This creates significant barriers for participants who:

1. **Cannot access H100 GPUs** - These are expensive and scarce resources not available to many researchers
2. **Use alternative hardware** - H20, A100, or consumer GPUs have different performance characteristics
3. **Want to validate strategies** - Need reproducible benchmarks that don't depend on specific hardware timing

### Current Challenge

A participant training on H20/A100 cannot reliably predict whether their model will pass the H100 evaluation because:
- Different GPU architectures have different compute/memory bandwidth ratios
- Wall-clock time varies significantly across hardware (H100 vs A100 vs H20)
- Optimization strategies that work well on A100 may underperform on H100 and vice versa

## Proposed Solution

Add **two complementary evaluation modes** alongside the existing wall-clock benchmark:

### 1. Fixed-Step Benchmark
- **Constraint**: Train for exactly N gradient steps (e.g., 20,000 steps)
- **Metric**: Final validation perplexity
- **Hardware-agnostic**: Anyone can run 20,000 steps regardless of GPU
- **Reproducible**: Same steps = same model (given fixed seed)

```bash
python train_gpt.py --mode fixed_steps --max_steps 20000
```

### 2. Fixed-Compute Benchmark  
- **Constraint**: Use exactly X FLOPs (e.g., 1e18 FLOPs)
- **Metric**: Final validation perplexity
- **Fair comparison**: Normalize for hardware differences
- **Measures efficiency**: How well do you use your compute budget?

```bash
python train_gpt.py --mode fixed_flops --max_flops 1e18
```

## Benefits

### For Participants
- ✅ Validate strategies on accessible hardware (RTX 4090, A100, etc.)
- ✅ Iterate faster without waiting for H100 access
- ✅ Compare results with others using different GPUs
- ✅ Focus on algorithmic improvements rather than hardware tuning

### For the Competition
- ✅ More inclusive - Lowers barrier to entry
- ✅ More scientific - Separates algorithm quality from hardware-specific optimization
- ✅ Complementary - Doesn't replace wall-clock benchmark, adds dimension
- ✅ Reproducible - Fixed steps/FLOPs are deterministic

## Implementation Sketch

### Option A: Separate Leaderboards
```
Leaderboards:
1. H100 10-minute (existing)
2. Fixed-Step 20K (new)
3. Fixed-Compute 1e18 FLOPs (new)
```

### Option B: Multi-Dimensional Scoring
```
Score = α × perplexity_wallclock + β × perplexity_steps + γ × perplexity_flops
```

### Option C: Tiered System
```
Bronze: Fixed-step benchmark (anyone can try)
Silver: Fixed-compute benchmark (intermediate)
Gold: H100 wall-clock benchmark (final validation)
```

## Technical Considerations

### FLOPs Counting
Can leverage existing tools:
- `torch.profiler.profile(with_flops=True)`
- Manual calculation for transformers: `FLOPs ≈ 6 × params × tokens`
- Track cumulative FLOPs and stop when budget reached

### Step Counting
Simple counter:
```python
global_step = 0
while global_step < max_steps:
    train_step()
    global_step += 1
```

## Related Work

Similar approaches used by:
- **MLPerf Training**: Uses both time-to-accuracy and FLOPs-to-accuracy
- **GPT-3 paper**: Reports FLOPs alongside wall-clock time  
- **Chinchilla paper**: FLOPs-optimal scaling laws

## Example Use Case

Researcher with only A100 access:
1. Trains model on A100 for 20K steps
2. Submits to fixed-step leaderboard → Gets 8.5 perplexity
3. Sees H100 requirement → Requests evaluation  
4. Organizers run on H100 → Validates within time limit
5. Gets added to main leaderboard

Without fixed-step benchmark, step 1-3 are impossible to validate.

## Questions for Discussion

1. Should fixed-step/compute be **optional** or **required** submissions?
2. What are reasonable step/FLOPs budgets? (20K steps? 1e18 FLOPs?)
3. Should we have separate leaderboards or combine scores?
4. How to handle non-determinism (e.g., different CUDA kernels)?

## Conclusion

Adding fixed-step/compute benchmarks would:
- **Lower barriers** for participants without H100 access
- **Improve reproducibility** by removing hardware variance
- **Complement** existing wall-clock benchmark
- **Maintain rigor** while increasing accessibility

Would love to hear community thoughts on this! Happy to help with implementation if there's interest.

---

**cc:** @0hq (maintainer)

**Related issues:** #280 (hardware availability), #402 (evaluation fairness)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fixed-step and fixed-compute evaluation benchmarks for hardware accessibility #519

[Feature Request] Add fixed-step and fixed-compute evaluation benchmarks for hardware accessibility

Motivation

Current Challenge

Proposed Solution

1. Fixed-Step Benchmark

2. Fixed-Compute Benchmark

Benefits

For Participants

For the Competition

Implementation Sketch

Option A: Separate Leaderboards

Option B: Multi-Dimensional Scoring

Option C: Tiered System

Technical Considerations

FLOPs Counting

Step Counting

Related Work

Example Use Case

Questions for Discussion

Conclusion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add fixed-step and fixed-compute evaluation benchmarks for hardware accessibility #519

Description

[Feature Request] Add fixed-step and fixed-compute evaluation benchmarks for hardware accessibility

Motivation

Current Challenge

Proposed Solution

1. Fixed-Step Benchmark

2. Fixed-Compute Benchmark

Benefits

For Participants

For the Competition

Implementation Sketch

Option A: Separate Leaderboards

Option B: Multi-Dimensional Scoring

Option C: Tiered System

Technical Considerations

FLOPs Counting

Step Counting

Related Work

Example Use Case

Questions for Discussion

Conclusion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions