Skip to content

Add fixed-step and fixed-compute evaluation benchmarks for hardware accessibility #519

@yhy19

Description

@yhy19

[Feature Request] Add fixed-step and fixed-compute evaluation benchmarks for hardware accessibility

Motivation

Currently, Parameter Golf evaluation primarily relies on H100 GPU availability with a 10-minute wall-clock time limit. This creates significant barriers for participants who:

  1. Cannot access H100 GPUs - These are expensive and scarce resources not available to many researchers
  2. Use alternative hardware - H20, A100, or consumer GPUs have different performance characteristics
  3. Want to validate strategies - Need reproducible benchmarks that don't depend on specific hardware timing

Current Challenge

A participant training on H20/A100 cannot reliably predict whether their model will pass the H100 evaluation because:

  • Different GPU architectures have different compute/memory bandwidth ratios
  • Wall-clock time varies significantly across hardware (H100 vs A100 vs H20)
  • Optimization strategies that work well on A100 may underperform on H100 and vice versa

Proposed Solution

Add two complementary evaluation modes alongside the existing wall-clock benchmark:

1. Fixed-Step Benchmark

  • Constraint: Train for exactly N gradient steps (e.g., 20,000 steps)
  • Metric: Final validation perplexity
  • Hardware-agnostic: Anyone can run 20,000 steps regardless of GPU
  • Reproducible: Same steps = same model (given fixed seed)
python train_gpt.py --mode fixed_steps --max_steps 20000

2. Fixed-Compute Benchmark

  • Constraint: Use exactly X FLOPs (e.g., 1e18 FLOPs)
  • Metric: Final validation perplexity
  • Fair comparison: Normalize for hardware differences
  • Measures efficiency: How well do you use your compute budget?
python train_gpt.py --mode fixed_flops --max_flops 1e18

Benefits

For Participants

  • ✅ Validate strategies on accessible hardware (RTX 4090, A100, etc.)
  • ✅ Iterate faster without waiting for H100 access
  • ✅ Compare results with others using different GPUs
  • ✅ Focus on algorithmic improvements rather than hardware tuning

For the Competition

  • ✅ More inclusive - Lowers barrier to entry
  • ✅ More scientific - Separates algorithm quality from hardware-specific optimization
  • ✅ Complementary - Doesn't replace wall-clock benchmark, adds dimension
  • ✅ Reproducible - Fixed steps/FLOPs are deterministic

Implementation Sketch

Option A: Separate Leaderboards

Leaderboards:
1. H100 10-minute (existing)
2. Fixed-Step 20K (new)
3. Fixed-Compute 1e18 FLOPs (new)

Option B: Multi-Dimensional Scoring

Score = α × perplexity_wallclock + β × perplexity_steps + γ × perplexity_flops

Option C: Tiered System

Bronze: Fixed-step benchmark (anyone can try)
Silver: Fixed-compute benchmark (intermediate)
Gold: H100 wall-clock benchmark (final validation)

Technical Considerations

FLOPs Counting

Can leverage existing tools:

  • torch.profiler.profile(with_flops=True)
  • Manual calculation for transformers: FLOPs ≈ 6 × params × tokens
  • Track cumulative FLOPs and stop when budget reached

Step Counting

Simple counter:

global_step = 0
while global_step < max_steps:
    train_step()
    global_step += 1

Related Work

Similar approaches used by:

  • MLPerf Training: Uses both time-to-accuracy and FLOPs-to-accuracy
  • GPT-3 paper: Reports FLOPs alongside wall-clock time
  • Chinchilla paper: FLOPs-optimal scaling laws

Example Use Case

Researcher with only A100 access:

  1. Trains model on A100 for 20K steps
  2. Submits to fixed-step leaderboard → Gets 8.5 perplexity
  3. Sees H100 requirement → Requests evaluation
  4. Organizers run on H100 → Validates within time limit
  5. Gets added to main leaderboard

Without fixed-step benchmark, step 1-3 are impossible to validate.

Questions for Discussion

  1. Should fixed-step/compute be optional or required submissions?
  2. What are reasonable step/FLOPs budgets? (20K steps? 1e18 FLOPs?)
  3. Should we have separate leaderboards or combine scores?
  4. How to handle non-determinism (e.g., different CUDA kernels)?

Conclusion

Adding fixed-step/compute benchmarks would:

  • Lower barriers for participants without H100 access
  • Improve reproducibility by removing hardware variance
  • Complement existing wall-clock benchmark
  • Maintain rigor while increasing accessibility

Would love to hear community thoughts on this! Happy to help with implementation if there's interest.


cc: @0hq (maintainer)

Related issues: #280 (hardware availability), #402 (evaluation fairness)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions