forked from karpathy/autoresearch
-
Notifications
You must be signed in to change notification settings - Fork 0
Experiment: MLP architecture — SwiGLU and activations #4
Copy link
Copy link
Open
Labels
experimentHyperparameter or architecture experimentHyperparameter or architecture experimentpriority: highHigh impact, run firstHigh impact, run firstsize: SSmall — 1-5 experiments or <1 hour workSmall — 1-5 experiments or <1 hour work
Description
Objective
Test alternative MLP architectures, primarily SwiGLU which consistently outperforms ReLU² in the literature.
Background
Current MLP uses 4x expansion with ReLU² activation (F.relu(x).square()). SwiGLU (used in LLaMA, Mistral) uses a gated linear unit with ~8/3x expansion to match parameter count.
Experiments
3A — SwiGLU (HIGH priority) ⭐ Highest expected value single change
Structural change to MLP class in train.py:
class MLP(nn.Module):
def __init__(self, config):
super().__init__()
hidden_dim = int(config.n_embd * 8 / 3)
hidden_dim = ((hidden_dim + 63) // 64) * 64 # round to multiple of 64
self.c_gate = nn.Linear(config.n_embd, hidden_dim, bias=False)
self.c_up = nn.Linear(config.n_embd, hidden_dim, bias=False)
self.c_proj = nn.Linear(hidden_dim, config.n_embd, bias=False)
def forward(self, x):
return self.c_proj(F.silu(self.c_gate(x)) * self.c_up(x))Note: Weight initialization in init_weights() must be updated to cover c_gate and c_up. The c_proj zero-init and the uniform init scale should be preserved.
3B — GELU (LOW priority)
# Replace F.relu(x).square() with F.gelu(x) in MLP.forward3C — SiLU (LOW priority)
# Replace F.relu(x).square() with F.silu(x)Expected Impact
- SwiGLU: Most likely improvement — consistently outperforms in the literature at matched param count
- GELU/SiLU: Neutral to slightly negative vs ReLU². Only run if SwiGLU fails
Execution
Run 3A first. If it wins, it becomes the default MLP for all subsequent experiments.
🤖 Generated with Claude Code
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
experimentHyperparameter or architecture experimentHyperparameter or architecture experimentpriority: highHigh impact, run firstHigh impact, run firstsize: SSmall — 1-5 experiments or <1 hour workSmall — 1-5 experiments or <1 hour work