Skip to content

Experiment: MLP architecture — SwiGLU and activations #4

@Jason-Adam

Description

@Jason-Adam

Objective

Test alternative MLP architectures, primarily SwiGLU which consistently outperforms ReLU² in the literature.

Background

Current MLP uses 4x expansion with ReLU² activation (F.relu(x).square()). SwiGLU (used in LLaMA, Mistral) uses a gated linear unit with ~8/3x expansion to match parameter count.

Experiments

3A — SwiGLU (HIGH priority) ⭐ Highest expected value single change

Structural change to MLP class in train.py:

class MLP(nn.Module):
    def __init__(self, config):
        super().__init__()
        hidden_dim = int(config.n_embd * 8 / 3)
        hidden_dim = ((hidden_dim + 63) // 64) * 64  # round to multiple of 64
        self.c_gate = nn.Linear(config.n_embd, hidden_dim, bias=False)
        self.c_up = nn.Linear(config.n_embd, hidden_dim, bias=False)
        self.c_proj = nn.Linear(hidden_dim, config.n_embd, bias=False)

    def forward(self, x):
        return self.c_proj(F.silu(self.c_gate(x)) * self.c_up(x))

Note: Weight initialization in init_weights() must be updated to cover c_gate and c_up. The c_proj zero-init and the uniform init scale should be preserved.

3B — GELU (LOW priority)

# Replace F.relu(x).square() with F.gelu(x) in MLP.forward

3C — SiLU (LOW priority)

# Replace F.relu(x).square() with F.silu(x)

Expected Impact

  • SwiGLU: Most likely improvement — consistently outperforms in the literature at matched param count
  • GELU/SiLU: Neutral to slightly negative vs ReLU². Only run if SwiGLU fails

Execution

Run 3A first. If it wins, it becomes the default MLP for all subsequent experiments.


🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    experimentHyperparameter or architecture experimentpriority: highHigh impact, run firstsize: SSmall — 1-5 experiments or <1 hour work

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions