forked from karpathy/autoresearch
-
Notifications
You must be signed in to change notification settings - Fork 0
Experiment: Muon optimizer tuning — momentum, Newton-Schulz, ablation #7
Copy link
Copy link
Open
Labels
experimentHyperparameter or architecture experimentHyperparameter or architecture experimentpriority: mediumMedium impactMedium impactsize: MMedium — 5-15 experiments or 1-3 hoursMedium — 5-15 experiments or 1-3 hours
Description
Objective
Tune Muon-specific parameters and ablate the hybrid optimizer design.
Current Muon Config
| Parameter | Value | Location |
|---|---|---|
| Momentum | 0.85→0.95 over 300 steps | get_muon_momentum() |
| ns_steps | 5 | Newton-Schulz iterations |
| beta2 | 0.95 | NorMuon variance reduction |
Momentum Schedule Experiments
| ID | Change | Priority |
|---|---|---|
| MM-1 | Start 0.9 (narrower ramp 0.9→0.95) | MEDIUM |
| MM-2 | End 0.99 (wider ramp 0.85→0.99) | MEDIUM |
| MM-3 | Faster ramp (150 steps instead of 300) | MEDIUM |
| MM-4 | Slower ramp (600 steps) | LOW |
| MM-5 | Constant 0.95 (no ramp) | MEDIUM |
Newton-Schulz Iterations
| ID | ns_steps | Priority | Rationale |
|---|---|---|---|
| NS-1 | 3 | MEDIUM | Faster per step → more training steps in 5 min |
| NS-2 | 7 | LOW | Better orthogonalization but slower |
Key insight: ns_steps=3 might win on wall-clock by getting more steps done, even if each step is slightly less accurate.
Muon Beta2
| ID | beta2 | Priority |
|---|---|---|
| MB-1 | 0.9 | LOW |
| MB-2 | 0.99 | LOW |
| MB-3 | 0.999 | LOW |
Pure AdamW Ablation (Diagnostic)
| ID | Change | Priority |
|---|---|---|
| ABL-1 | Replace all Muon with AdamW (keep MATRIX_LR=0.04) | LOW |
| ABL-2 | Pure AdamW with standard LR=0.001 for matrices | LOW |
Purpose: Quantify the gap. If pure AdamW is close, the Muon complexity may not be justified at this scale, and removing it is a simplification win per the simplicity criterion.
Execution Order
- NS-1 (potential speed win)
- MM-5 (simplification test)
- MM-1, MM-3 (schedule variants)
- MB-2 (beta2 check)
- ABL-1 (diagnostic, run last)
🤖 Generated with Claude Code
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
experimentHyperparameter or architecture experimentHyperparameter or architecture experimentpriority: mediumMedium impactMedium impactsize: MMedium — 5-15 experiments or 1-3 hoursMedium — 5-15 experiments or 1-3 hours