Experiment: Optimizer — LR sweeps, warmup, warmdown, betas

## Objective

Systematically tune the hybrid MuonAdamW optimizer's learning rates, schedules, and hyperparameters.

## Current Baseline

| Parameter | Value | Notes |
|-----------|-------|-------|
| `EMBEDDING_LR` | 0.6 | Token embeddings (AdamW) |
| `UNEMBEDDING_LR` | 0.004 | lm_head (AdamW) |
| `MATRIX_LR` | 0.04 | Transformer matrices (Muon) |
| `SCALAR_LR` | 0.5 | Per-layer scalars (AdamW) |
| `ADAM_BETAS` | (0.8, 0.95) | Low beta1 is unusual |
| `WEIGHT_DECAY` | 0.2 | Cautious, Muon only |
| `WARMUP_RATIO` | 0.0 | **No warmup!** |
| `WARMDOWN_RATIO` | 0.5 | 50% of training is cooldown |
| `FINAL_LR_FRAC` | 0.0 | Decays to zero |

## Phase 1: Warmup Introduction (HIGH priority) ⭐

Zero warmup is the most surprising config choice. Most transformers benefit from warmup.

| ID | WARMUP_RATIO | Priority |
|----|-------------|----------|
| W-1 | 0.01 | HIGH |
| W-2 | 0.05 | HIGH (try first) |
| W-3 | 0.1 | HIGH |

**Run W-2 first**, then bracket with W-1 and W-3.

## Phase 2: Matrix LR Sweep (HIGH priority)

Muon's matrix LR governs the bulk of parameters. Highest-leverage single knob.

| ID | MATRIX_LR | Priority |
|----|-----------|----------|
| MLR-1 | 0.02 | HIGH |
| MLR-2 | 0.03 | HIGH |
| MLR-3 | 0.05 | HIGH (try first) |
| MLR-4 | 0.06 | MEDIUM |
| MLR-5 | 0.08 | MEDIUM |

Short runs may prefer slightly higher LR. Start with MLR-3 and MLR-1 to bracket.

## Phase 3: Embedding LR Sweep (HIGH priority)

0.6 is very high for embeddings. May be over-aggressive.

| ID | EMBEDDING_LR | Priority |
|----|-------------|----------|
| ELR-1 | 0.3 | HIGH (try first) |
| ELR-2 | 0.45 | MEDIUM |
| ELR-3 | 0.8 | MEDIUM |
| ELR-4 | 0.15 | MEDIUM |

## Phase 4: Warmdown + Betas (MEDIUM priority)

50% warmdown is aggressive — half the training is in cooldown.

| ID | Change | Priority |
|----|--------|----------|
| WDN-1 | `WARMDOWN_RATIO = 0.3` | MEDIUM (try first) |
| WDN-2 | `WARMDOWN_RATIO = 0.7` | MEDIUM |
| WDN-3 | `WARMDOWN_RATIO = 0.15` | MEDIUM |
| AB-1 | `ADAM_BETAS = (0.9, 0.95)` | MEDIUM |
| AB-2 | `ADAM_BETAS = (0.9, 0.999)` | MEDIUM |

## Phase 5: Weight Decay (MEDIUM priority)

| ID | WEIGHT_DECAY | Priority |
|----|-------------|----------|
| WD-1 | 0.1 | MEDIUM |
| WD-2 | 0.3 | MEDIUM |
| WD-3 | 0.0 (no WD) | MEDIUM |

Also test schedule variants: constant WD (remove `(1 - progress)` decay), cosine decay.

## Phase 6: Low Priority

| ID | Change | Priority |
|----|--------|----------|
| ULR-1 | `UNEMBEDDING_LR = 0.008` | LOW |
| SLR-1 | `SCALAR_LR = 0.3` | LOW |
| FLR-1 | `FINAL_LR_FRAC = 0.05` | LOW |

## Execution Strategy

~30 experiments total at 5 min each = ~2.5 hours. Run phase by phase — each phase's winner becomes the new baseline for subsequent phases.

After all phases, combine ALL winners into a single run to measure cumulative improvement.

## Interaction Effects to Watch

- **Warmup × Matrix LR**: Higher LR may be viable with warmup
- **Warmdown × Final LR**: If warmdown changes, final LR fraction may matter more
- **Muon momentum × Warmup**: Both stabilize early training — may be redundant

---

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment: Optimizer — LR sweeps, warmup, warmdown, betas #6

Objective

Current Baseline

Phase 1: Warmup Introduction (HIGH priority) ⭐

Phase 2: Matrix LR Sweep (HIGH priority)

Phase 3: Embedding LR Sweep (HIGH priority)

Phase 4: Warmdown + Betas (MEDIUM priority)

Phase 5: Weight Decay (MEDIUM priority)

Phase 6: Low Priority

Execution Strategy

Interaction Effects to Watch

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Parameter	Value	Notes
`EMBEDDING_LR`	0.6	Token embeddings (AdamW)
`UNEMBEDDING_LR`	0.004	lm_head (AdamW)
`MATRIX_LR`	0.04	Transformer matrices (Muon)
`SCALAR_LR`	0.5	Per-layer scalars (AdamW)
`ADAM_BETAS`	(0.8, 0.95)	Low beta1 is unusual
`WEIGHT_DECAY`	0.2	Cautious, Muon only
`WARMUP_RATIO`	0.0	No warmup!
`WARMDOWN_RATIO`	0.5	50% of training is cooldown
`FINAL_LR_FRAC`	0.0	Decays to zero

ID	MATRIX_LR	Priority
MLR-1	0.02	HIGH
MLR-2	0.03	HIGH
MLR-3	0.05	HIGH (try first)
MLR-4	0.06	MEDIUM
MLR-5	0.08	MEDIUM

ID	EMBEDDING_LR	Priority
ELR-1	0.3	HIGH (try first)
ELR-2	0.45	MEDIUM
ELR-3	0.8	MEDIUM
ELR-4	0.15	MEDIUM

ID	Change	Priority
WDN-1	`WARMDOWN_RATIO = 0.3`	MEDIUM (try first)
WDN-2	`WARMDOWN_RATIO = 0.7`	MEDIUM
WDN-3	`WARMDOWN_RATIO = 0.15`	MEDIUM
AB-1	`ADAM_BETAS = (0.9, 0.95)`	MEDIUM
AB-2	`ADAM_BETAS = (0.9, 0.999)`	MEDIUM

ID	Change	Priority
ULR-1	`UNEMBEDDING_LR = 0.008`	LOW
SLR-1	`SCALAR_LR = 0.3`	LOW
FLR-1	`FINAL_LR_FRAC = 0.05`	LOW

Experiment: Optimizer — LR sweeps, warmup, warmdown, betas #6

Description

Objective

Current Baseline

Phase 1: Warmup Introduction (HIGH priority) ⭐

Phase 2: Matrix LR Sweep (HIGH priority)

Phase 3: Embedding LR Sweep (HIGH priority)

Phase 4: Warmdown + Betas (MEDIUM priority)

Phase 5: Weight Decay (MEDIUM priority)

Phase 6: Low Priority

Execution Strategy

Interaction Effects to Watch

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions