Experiment: Muon optimizer tuning — momentum, Newton-Schulz, ablation

## Objective

Tune Muon-specific parameters and ablate the hybrid optimizer design.

## Current Muon Config

| Parameter | Value | Location |
|-----------|-------|----------|
| Momentum | 0.85→0.95 over 300 steps | `get_muon_momentum()` |
| ns_steps | 5 | Newton-Schulz iterations |
| beta2 | 0.95 | NorMuon variance reduction |

## Momentum Schedule Experiments

| ID | Change | Priority |
|----|--------|----------|
| MM-1 | Start 0.9 (narrower ramp 0.9→0.95) | MEDIUM |
| MM-2 | End 0.99 (wider ramp 0.85→0.99) | MEDIUM |
| MM-3 | Faster ramp (150 steps instead of 300) | MEDIUM |
| MM-4 | Slower ramp (600 steps) | LOW |
| MM-5 | Constant 0.95 (no ramp) | MEDIUM |

## Newton-Schulz Iterations

| ID | ns_steps | Priority | Rationale |
|----|----------|----------|-----------|
| NS-1 | 3 | MEDIUM | Faster per step → more training steps in 5 min |
| NS-2 | 7 | LOW | Better orthogonalization but slower |

**Key insight**: ns_steps=3 might win on wall-clock by getting more steps done, even if each step is slightly less accurate.

## Muon Beta2

| ID | beta2 | Priority |
|----|-------|----------|
| MB-1 | 0.9 | LOW |
| MB-2 | 0.99 | LOW |
| MB-3 | 0.999 | LOW |

## Pure AdamW Ablation (Diagnostic)

| ID | Change | Priority |
|----|--------|----------|
| ABL-1 | Replace all Muon with AdamW (keep MATRIX_LR=0.04) | LOW |
| ABL-2 | Pure AdamW with standard LR=0.001 for matrices | LOW |

**Purpose**: Quantify the gap. If pure AdamW is close, the Muon complexity may not be justified at this scale, and removing it is a simplification win per the simplicity criterion.

## Execution Order

1. NS-1 (potential speed win)
2. MM-5 (simplification test)
3. MM-1, MM-3 (schedule variants)
4. MB-2 (beta2 check)
5. ABL-1 (diagnostic, run last)

---

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment: Muon optimizer tuning — momentum, Newton-Schulz, ablation #7

Objective

Current Muon Config

Momentum Schedule Experiments

Newton-Schulz Iterations

Muon Beta2

Pure AdamW Ablation (Diagnostic)

Execution Order

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Parameter	Value	Location
Momentum	0.85→0.95 over 300 steps	`get_muon_momentum()`
ns_steps	5	Newton-Schulz iterations
beta2	0.95	NorMuon variance reduction

ID	Change	Priority
MM-1	Start 0.9 (narrower ramp 0.9→0.95)	MEDIUM
MM-2	End 0.99 (wider ramp 0.85→0.99)	MEDIUM
MM-3	Faster ramp (150 steps instead of 300)	MEDIUM
MM-4	Slower ramp (600 steps)	LOW
MM-5	Constant 0.95 (no ramp)	MEDIUM

ID	ns_steps	Priority	Rationale
NS-1	3	MEDIUM	Faster per step → more training steps in 5 min
NS-2	7	LOW	Better orthogonalization but slower

ID	Change	Priority
ABL-1	Replace all Muon with AdamW (keep MATRIX_LR=0.04)	LOW
ABL-2	Pure AdamW with standard LR=0.001 for matrices	LOW

Experiment: Muon optimizer tuning — momentum, Newton-Schulz, ablation #7

Description

Objective

Current Muon Config

Momentum Schedule Experiments

Newton-Schulz Iterations

Muon Beta2

Pure AdamW Ablation (Diagnostic)

Execution Order

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions