Skip to content

Experiment: Depth vs width architecture search #3

@Jason-Adam

Description

@Jason-Adam

Objective

Find the optimal depth/width tradeoff for the 5-minute training budget. Currently DEPTH=8, ASPECT_RATIO=64 gives model_dim=512 with 4 attention heads.

Background

At small scale, width tends to matter more than depth, but there's a sweet spot. model_dim = DEPTH * ASPECT_RATIO and must be divisible by HEAD_DIM (128).

Experiments

ID DEPTH ASPECT_RATIO model_dim n_head Priority Rationale
1A 6 128 768 6 HIGH Wider, shallower — tests if width-starved
1B 12 32 384 3 MEDIUM Deeper, narrower — tests depth compensation
1C 10 64 640 5 HIGH Balanced deeper — moderate param increase
1D 8 96 768 6 HIGH Same depth, 50% wider

Code Changes

All changes are in train.py lines 596-614:

DEPTH = <value>
ASPECT_RATIO = <value>

Expected Outcomes

  • If 1D (wider) wins: model is width-starved, bias toward wider shapes
  • If 1C (deeper) wins: depth matters, GQA reinvestment becomes attractive
  • 1B likely negative (3 heads is very few) but useful diagnostic

Execution

Run 1D and 1C first (highest expected value), then 1A, then 1B. Each run is 5 minutes.

Decision Gate

Winner becomes the new baseline shape for subsequent architecture experiments (attention patterns, GQA, head dim).


🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    experimentHyperparameter or architecture experimentpriority: highHigh impact, run firstsize: SSmall — 1-5 experiments or <1 hour work

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions