Skip to content

Experiment: Optimizer — LR sweeps, warmup, warmdown, betas #6

@Jason-Adam

Description

@Jason-Adam

Objective

Systematically tune the hybrid MuonAdamW optimizer's learning rates, schedules, and hyperparameters.

Current Baseline

Parameter Value Notes
EMBEDDING_LR 0.6 Token embeddings (AdamW)
UNEMBEDDING_LR 0.004 lm_head (AdamW)
MATRIX_LR 0.04 Transformer matrices (Muon)
SCALAR_LR 0.5 Per-layer scalars (AdamW)
ADAM_BETAS (0.8, 0.95) Low beta1 is unusual
WEIGHT_DECAY 0.2 Cautious, Muon only
WARMUP_RATIO 0.0 No warmup!
WARMDOWN_RATIO 0.5 50% of training is cooldown
FINAL_LR_FRAC 0.0 Decays to zero

Phase 1: Warmup Introduction (HIGH priority) ⭐

Zero warmup is the most surprising config choice. Most transformers benefit from warmup.

ID WARMUP_RATIO Priority
W-1 0.01 HIGH
W-2 0.05 HIGH (try first)
W-3 0.1 HIGH

Run W-2 first, then bracket with W-1 and W-3.

Phase 2: Matrix LR Sweep (HIGH priority)

Muon's matrix LR governs the bulk of parameters. Highest-leverage single knob.

ID MATRIX_LR Priority
MLR-1 0.02 HIGH
MLR-2 0.03 HIGH
MLR-3 0.05 HIGH (try first)
MLR-4 0.06 MEDIUM
MLR-5 0.08 MEDIUM

Short runs may prefer slightly higher LR. Start with MLR-3 and MLR-1 to bracket.

Phase 3: Embedding LR Sweep (HIGH priority)

0.6 is very high for embeddings. May be over-aggressive.

ID EMBEDDING_LR Priority
ELR-1 0.3 HIGH (try first)
ELR-2 0.45 MEDIUM
ELR-3 0.8 MEDIUM
ELR-4 0.15 MEDIUM

Phase 4: Warmdown + Betas (MEDIUM priority)

50% warmdown is aggressive — half the training is in cooldown.

ID Change Priority
WDN-1 WARMDOWN_RATIO = 0.3 MEDIUM (try first)
WDN-2 WARMDOWN_RATIO = 0.7 MEDIUM
WDN-3 WARMDOWN_RATIO = 0.15 MEDIUM
AB-1 ADAM_BETAS = (0.9, 0.95) MEDIUM
AB-2 ADAM_BETAS = (0.9, 0.999) MEDIUM

Phase 5: Weight Decay (MEDIUM priority)

ID WEIGHT_DECAY Priority
WD-1 0.1 MEDIUM
WD-2 0.3 MEDIUM
WD-3 0.0 (no WD) MEDIUM

Also test schedule variants: constant WD (remove (1 - progress) decay), cosine decay.

Phase 6: Low Priority

ID Change Priority
ULR-1 UNEMBEDDING_LR = 0.008 LOW
SLR-1 SCALAR_LR = 0.3 LOW
FLR-1 FINAL_LR_FRAC = 0.05 LOW

Execution Strategy

~30 experiments total at 5 min each = ~2.5 hours. Run phase by phase — each phase's winner becomes the new baseline for subsequent phases.

After all phases, combine ALL winners into a single run to measure cumulative improvement.

Interaction Effects to Watch

  • Warmup × Matrix LR: Higher LR may be viable with warmup
  • Warmdown × Final LR: If warmdown changes, final LR fraction may matter more
  • Muon momentum × Warmup: Both stabilize early training — may be redundant

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    experimentHyperparameter or architecture experimentpriority: highHigh impact, run firstsize: LLarge — 15+ experiments or 3+ hours

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions