forked from karpathy/autoresearch
-
Notifications
You must be signed in to change notification settings - Fork 0
Experiment: Depth vs width architecture search #3
Copy link
Copy link
Open
Labels
experimentHyperparameter or architecture experimentHyperparameter or architecture experimentpriority: highHigh impact, run firstHigh impact, run firstsize: SSmall — 1-5 experiments or <1 hour workSmall — 1-5 experiments or <1 hour work
Description
Objective
Find the optimal depth/width tradeoff for the 5-minute training budget. Currently DEPTH=8, ASPECT_RATIO=64 gives model_dim=512 with 4 attention heads.
Background
At small scale, width tends to matter more than depth, but there's a sweet spot. model_dim = DEPTH * ASPECT_RATIO and must be divisible by HEAD_DIM (128).
Experiments
| ID | DEPTH | ASPECT_RATIO | model_dim | n_head | Priority | Rationale |
|---|---|---|---|---|---|---|
| 1A | 6 | 128 | 768 | 6 | HIGH | Wider, shallower — tests if width-starved |
| 1B | 12 | 32 | 384 | 3 | MEDIUM | Deeper, narrower — tests depth compensation |
| 1C | 10 | 64 | 640 | 5 | HIGH | Balanced deeper — moderate param increase |
| 1D | 8 | 96 | 768 | 6 | HIGH | Same depth, 50% wider |
Code Changes
All changes are in train.py lines 596-614:
DEPTH = <value>
ASPECT_RATIO = <value>Expected Outcomes
- If 1D (wider) wins: model is width-starved, bias toward wider shapes
- If 1C (deeper) wins: depth matters, GQA reinvestment becomes attractive
- 1B likely negative (3 heads is very few) but useful diagnostic
Execution
Run 1D and 1C first (highest expected value), then 1A, then 1B. Each run is 5 minutes.
Decision Gate
Winner becomes the new baseline shape for subsequent architecture experiments (attention patterns, GQA, head dim).
🤖 Generated with Claude Code
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
experimentHyperparameter or architecture experimentHyperparameter or architecture experimentpriority: highHigh impact, run firstHigh impact, run firstsize: SSmall — 1-5 experiments or <1 hour workSmall — 1-5 experiments or <1 hour work