-
Notifications
You must be signed in to change notification settings - Fork 0
Experiment: Optimizer — LR sweeps, warmup, warmdown, betas #6
Description
Objective
Systematically tune the hybrid MuonAdamW optimizer's learning rates, schedules, and hyperparameters.
Current Baseline
| Parameter | Value | Notes |
|---|---|---|
EMBEDDING_LR |
0.6 | Token embeddings (AdamW) |
UNEMBEDDING_LR |
0.004 | lm_head (AdamW) |
MATRIX_LR |
0.04 | Transformer matrices (Muon) |
SCALAR_LR |
0.5 | Per-layer scalars (AdamW) |
ADAM_BETAS |
(0.8, 0.95) | Low beta1 is unusual |
WEIGHT_DECAY |
0.2 | Cautious, Muon only |
WARMUP_RATIO |
0.0 | No warmup! |
WARMDOWN_RATIO |
0.5 | 50% of training is cooldown |
FINAL_LR_FRAC |
0.0 | Decays to zero |
Phase 1: Warmup Introduction (HIGH priority) ⭐
Zero warmup is the most surprising config choice. Most transformers benefit from warmup.
| ID | WARMUP_RATIO | Priority |
|---|---|---|
| W-1 | 0.01 | HIGH |
| W-2 | 0.05 | HIGH (try first) |
| W-3 | 0.1 | HIGH |
Run W-2 first, then bracket with W-1 and W-3.
Phase 2: Matrix LR Sweep (HIGH priority)
Muon's matrix LR governs the bulk of parameters. Highest-leverage single knob.
| ID | MATRIX_LR | Priority |
|---|---|---|
| MLR-1 | 0.02 | HIGH |
| MLR-2 | 0.03 | HIGH |
| MLR-3 | 0.05 | HIGH (try first) |
| MLR-4 | 0.06 | MEDIUM |
| MLR-5 | 0.08 | MEDIUM |
Short runs may prefer slightly higher LR. Start with MLR-3 and MLR-1 to bracket.
Phase 3: Embedding LR Sweep (HIGH priority)
0.6 is very high for embeddings. May be over-aggressive.
| ID | EMBEDDING_LR | Priority |
|---|---|---|
| ELR-1 | 0.3 | HIGH (try first) |
| ELR-2 | 0.45 | MEDIUM |
| ELR-3 | 0.8 | MEDIUM |
| ELR-4 | 0.15 | MEDIUM |
Phase 4: Warmdown + Betas (MEDIUM priority)
50% warmdown is aggressive — half the training is in cooldown.
| ID | Change | Priority |
|---|---|---|
| WDN-1 | WARMDOWN_RATIO = 0.3 |
MEDIUM (try first) |
| WDN-2 | WARMDOWN_RATIO = 0.7 |
MEDIUM |
| WDN-3 | WARMDOWN_RATIO = 0.15 |
MEDIUM |
| AB-1 | ADAM_BETAS = (0.9, 0.95) |
MEDIUM |
| AB-2 | ADAM_BETAS = (0.9, 0.999) |
MEDIUM |
Phase 5: Weight Decay (MEDIUM priority)
| ID | WEIGHT_DECAY | Priority |
|---|---|---|
| WD-1 | 0.1 | MEDIUM |
| WD-2 | 0.3 | MEDIUM |
| WD-3 | 0.0 (no WD) | MEDIUM |
Also test schedule variants: constant WD (remove (1 - progress) decay), cosine decay.
Phase 6: Low Priority
| ID | Change | Priority |
|---|---|---|
| ULR-1 | UNEMBEDDING_LR = 0.008 |
LOW |
| SLR-1 | SCALAR_LR = 0.3 |
LOW |
| FLR-1 | FINAL_LR_FRAC = 0.05 |
LOW |
Execution Strategy
~30 experiments total at 5 min each = ~2.5 hours. Run phase by phase — each phase's winner becomes the new baseline for subsequent phases.
After all phases, combine ALL winners into a single run to measure cumulative improvement.
Interaction Effects to Watch
- Warmup × Matrix LR: Higher LR may be viable with warmup
- Warmdown × Final LR: If warmdown changes, final LR fraction may matter more
- Muon momentum × Warmup: Both stabilize early training — may be redundant
🤖 Generated with Claude Code