Experiment: Depth vs width architecture search

## Objective

Find the optimal depth/width tradeoff for the 5-minute training budget. Currently `DEPTH=8, ASPECT_RATIO=64` gives `model_dim=512` with 4 attention heads.

## Background

At small scale, width tends to matter more than depth, but there's a sweet spot. `model_dim = DEPTH * ASPECT_RATIO` and must be divisible by `HEAD_DIM` (128).

## Experiments

| ID | DEPTH | ASPECT_RATIO | model_dim | n_head | Priority | Rationale |
|----|-------|-------------|-----------|--------|----------|-----------|
| 1A | 6 | 128 | 768 | 6 | HIGH | Wider, shallower — tests if width-starved |
| 1B | 12 | 32 | 384 | 3 | MEDIUM | Deeper, narrower — tests depth compensation |
| 1C | 10 | 64 | 640 | 5 | HIGH | Balanced deeper — moderate param increase |
| 1D | 8 | 96 | 768 | 6 | HIGH | Same depth, 50% wider |

## Code Changes

All changes are in `train.py` lines 596-614:
```python
DEPTH = <value>
ASPECT_RATIO = <value>
```

## Expected Outcomes

- If 1D (wider) wins: model is width-starved, bias toward wider shapes
- If 1C (deeper) wins: depth matters, GQA reinvestment becomes attractive
- 1B likely negative (3 heads is very few) but useful diagnostic

## Execution

Run 1D and 1C first (highest expected value), then 1A, then 1B. Each run is 5 minutes.

## Decision Gate

Winner becomes the new baseline shape for subsequent architecture experiments (attention patterns, GQA, head dim).

---

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment: Depth vs width architecture search #3

Objective

Background

Experiments

Code Changes

Expected Outcomes

Execution

Decision Gate

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

ID	DEPTH	ASPECT_RATIO	model_dim	n_head	Priority	Rationale
1A	6	128	768	6	HIGH	Wider, shallower — tests if width-starved
1B	12	32	384	3	MEDIUM	Deeper, narrower — tests depth compensation
1C	10	64	640	5	HIGH	Balanced deeper — moderate param increase
1D	8	96	768	6	HIGH	Same depth, 50% wider

Experiment: Depth vs width architecture search #3

Description

Objective

Background

Experiments

Code Changes

Expected Outcomes

Execution

Decision Gate

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions