Skip to content

Experiment: Muon optimizer tuning — momentum, Newton-Schulz, ablation #7

@Jason-Adam

Description

@Jason-Adam

Objective

Tune Muon-specific parameters and ablate the hybrid optimizer design.

Current Muon Config

Parameter Value Location
Momentum 0.85→0.95 over 300 steps get_muon_momentum()
ns_steps 5 Newton-Schulz iterations
beta2 0.95 NorMuon variance reduction

Momentum Schedule Experiments

ID Change Priority
MM-1 Start 0.9 (narrower ramp 0.9→0.95) MEDIUM
MM-2 End 0.99 (wider ramp 0.85→0.99) MEDIUM
MM-3 Faster ramp (150 steps instead of 300) MEDIUM
MM-4 Slower ramp (600 steps) LOW
MM-5 Constant 0.95 (no ramp) MEDIUM

Newton-Schulz Iterations

ID ns_steps Priority Rationale
NS-1 3 MEDIUM Faster per step → more training steps in 5 min
NS-2 7 LOW Better orthogonalization but slower

Key insight: ns_steps=3 might win on wall-clock by getting more steps done, even if each step is slightly less accurate.

Muon Beta2

ID beta2 Priority
MB-1 0.9 LOW
MB-2 0.99 LOW
MB-3 0.999 LOW

Pure AdamW Ablation (Diagnostic)

ID Change Priority
ABL-1 Replace all Muon with AdamW (keep MATRIX_LR=0.04) LOW
ABL-2 Pure AdamW with standard LR=0.001 for matrices LOW

Purpose: Quantify the gap. If pure AdamW is close, the Muon complexity may not be justified at this scale, and removing it is a simplification win per the simplicity criterion.

Execution Order

  1. NS-1 (potential speed win)
  2. MM-5 (simplification test)
  3. MM-1, MM-3 (schedule variants)
  4. MB-2 (beta2 check)
  5. ABL-1 (diagnostic, run last)

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    experimentHyperparameter or architecture experimentpriority: mediumMedium impactsize: MMedium — 5-15 experiments or 1-3 hours

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions