Experiment: MLP architecture — SwiGLU and activations

## Objective

Test alternative MLP architectures, primarily SwiGLU which consistently outperforms ReLU² in the literature.

## Background

Current MLP uses 4x expansion with ReLU² activation (`F.relu(x).square()`). SwiGLU (used in LLaMA, Mistral) uses a gated linear unit with ~8/3x expansion to match parameter count.

## Experiments

### 3A — SwiGLU (HIGH priority) ⭐ Highest expected value single change

Structural change to MLP class in `train.py`:
```python
class MLP(nn.Module):
    def __init__(self, config):
        super().__init__()
        hidden_dim = int(config.n_embd * 8 / 3)
        hidden_dim = ((hidden_dim + 63) // 64) * 64  # round to multiple of 64
        self.c_gate = nn.Linear(config.n_embd, hidden_dim, bias=False)
        self.c_up = nn.Linear(config.n_embd, hidden_dim, bias=False)
        self.c_proj = nn.Linear(hidden_dim, config.n_embd, bias=False)

    def forward(self, x):
        return self.c_proj(F.silu(self.c_gate(x)) * self.c_up(x))
```

**Note**: Weight initialization in `init_weights()` must be updated to cover `c_gate` and `c_up`. The `c_proj` zero-init and the uniform init scale should be preserved.

### 3B — GELU (LOW priority)
```python
# Replace F.relu(x).square() with F.gelu(x) in MLP.forward
```

### 3C — SiLU (LOW priority)
```python
# Replace F.relu(x).square() with F.silu(x)
```

## Expected Impact

- **SwiGLU**: Most likely improvement — consistently outperforms in the literature at matched param count
- **GELU/SiLU**: Neutral to slightly negative vs ReLU². Only run if SwiGLU fails

## Execution

Run 3A first. If it wins, it becomes the default MLP for all subsequent experiments.

---

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment: MLP architecture — SwiGLU and activations #4

Objective

Background

Experiments

3A — SwiGLU (HIGH priority) ⭐ Highest expected value single change

3B — GELU (LOW priority)

3C — SiLU (LOW priority)

Expected Impact

Execution

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Experiment: MLP architecture — SwiGLU and activations #4

Description

Objective

Background

Experiments

3A — SwiGLU (HIGH priority) ⭐ Highest expected value single change

3B — GELU (LOW priority)

3C — SiLU (LOW priority)

Expected Impact

Execution

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions