-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
[Feature Request] Add fixed-step and fixed-compute evaluation benchmarks for hardware accessibility
Motivation
Currently, Parameter Golf evaluation primarily relies on H100 GPU availability with a 10-minute wall-clock time limit. This creates significant barriers for participants who:
- Cannot access H100 GPUs - These are expensive and scarce resources not available to many researchers
- Use alternative hardware - H20, A100, or consumer GPUs have different performance characteristics
- Want to validate strategies - Need reproducible benchmarks that don't depend on specific hardware timing
Current Challenge
A participant training on H20/A100 cannot reliably predict whether their model will pass the H100 evaluation because:
- Different GPU architectures have different compute/memory bandwidth ratios
- Wall-clock time varies significantly across hardware (H100 vs A100 vs H20)
- Optimization strategies that work well on A100 may underperform on H100 and vice versa
Proposed Solution
Add two complementary evaluation modes alongside the existing wall-clock benchmark:
1. Fixed-Step Benchmark
- Constraint: Train for exactly N gradient steps (e.g., 20,000 steps)
- Metric: Final validation perplexity
- Hardware-agnostic: Anyone can run 20,000 steps regardless of GPU
- Reproducible: Same steps = same model (given fixed seed)
python train_gpt.py --mode fixed_steps --max_steps 200002. Fixed-Compute Benchmark
- Constraint: Use exactly X FLOPs (e.g., 1e18 FLOPs)
- Metric: Final validation perplexity
- Fair comparison: Normalize for hardware differences
- Measures efficiency: How well do you use your compute budget?
python train_gpt.py --mode fixed_flops --max_flops 1e18Benefits
For Participants
- ✅ Validate strategies on accessible hardware (RTX 4090, A100, etc.)
- ✅ Iterate faster without waiting for H100 access
- ✅ Compare results with others using different GPUs
- ✅ Focus on algorithmic improvements rather than hardware tuning
For the Competition
- ✅ More inclusive - Lowers barrier to entry
- ✅ More scientific - Separates algorithm quality from hardware-specific optimization
- ✅ Complementary - Doesn't replace wall-clock benchmark, adds dimension
- ✅ Reproducible - Fixed steps/FLOPs are deterministic
Implementation Sketch
Option A: Separate Leaderboards
Leaderboards:
1. H100 10-minute (existing)
2. Fixed-Step 20K (new)
3. Fixed-Compute 1e18 FLOPs (new)
Option B: Multi-Dimensional Scoring
Score = α × perplexity_wallclock + β × perplexity_steps + γ × perplexity_flops
Option C: Tiered System
Bronze: Fixed-step benchmark (anyone can try)
Silver: Fixed-compute benchmark (intermediate)
Gold: H100 wall-clock benchmark (final validation)
Technical Considerations
FLOPs Counting
Can leverage existing tools:
torch.profiler.profile(with_flops=True)- Manual calculation for transformers:
FLOPs ≈ 6 × params × tokens - Track cumulative FLOPs and stop when budget reached
Step Counting
Simple counter:
global_step = 0
while global_step < max_steps:
train_step()
global_step += 1Related Work
Similar approaches used by:
- MLPerf Training: Uses both time-to-accuracy and FLOPs-to-accuracy
- GPT-3 paper: Reports FLOPs alongside wall-clock time
- Chinchilla paper: FLOPs-optimal scaling laws
Example Use Case
Researcher with only A100 access:
- Trains model on A100 for 20K steps
- Submits to fixed-step leaderboard → Gets 8.5 perplexity
- Sees H100 requirement → Requests evaluation
- Organizers run on H100 → Validates within time limit
- Gets added to main leaderboard
Without fixed-step benchmark, step 1-3 are impossible to validate.
Questions for Discussion
- Should fixed-step/compute be optional or required submissions?
- What are reasonable step/FLOPs budgets? (20K steps? 1e18 FLOPs?)
- Should we have separate leaderboards or combine scores?
- How to handle non-determinism (e.g., different CUDA kernels)?
Conclusion
Adding fixed-step/compute benchmarks would:
- Lower barriers for participants without H100 access
- Improve reproducibility by removing hardware variance
- Complement existing wall-clock benchmark
- Maintain rigor while increasing accessibility
Would love to hear community thoughts on this! Happy to help with implementation if there's interest.
cc: @0hq (maintainer)
Related issues: #280 (hardware availability), #402 (evaluation fairness)