Record: Depth Recurrence (layers 4 and 5 repeated): val_bpb 1.1182 by msisovic · Pull Request #686 · openai/parameter-golf

msisovic · 2026-03-25T05:58:51Z

Summary

Building on PR #549, I explored two directions for improving val_bpb: width scaling (MODEL_DIM=576) and depth scaling (adding layers). Width scaling to dim=576 provided a regression in performance. Depth scaling to 12 independent layers at dim=512 reached 1.1126 post-TTT - significantly better - so I decided to go in that direction.

This led me to depth recurrence: re-executing mid-network layers with independent learnable block scalars, getting the depth benefit without the parameter/size cost. Layers 4 and 5 are each executed twice in sequence (pattern: 0,1,2,3,4,5,4,5,6,7,8,9,10), producing 13 virtual layers from 11 physical. Only ~2K block scalar params are added. Dual recurrence recovers ~70% of the independent 12-layer gain while keeping the artifact well under budget at ~15.9MB.

I also confirmed that tied TTT (no weight untying for recurrent layers) performs equivalently to untied, and that the TTT gain (~0.0025 BPB) is consistent regardless of ecurrence config. Everything else (TTT, int6 quantization, SWA, bigram embeddings, value embeddings, Muon optimizer) is inherited from #549.

Config	Params	Artifact	Post-TTT val_bpb
PR #549 baseline (11L)	~24M	~19.5MB	1.1194
Full 12L (over budget)	~29M	~17.3MB	1.1126
Recur L5 (11→12 virtual)	~27M	~15.9MB	1.1180
Recur L4,5 (11→13 virtual)	~27M	~15.9MB	1.1182

Reproducibility

Seed	val_bpb
1337	1.11788
2025	1.11906
2024	1.11763
Mean	1.11819

Seed 1337 complete (val_bpb=1.1179). Seeds 42 and 2024 need rerun after GPU restart (stale CUDA contexts blocking clean runs). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

msisovic and others added 6 commits March 25, 2026 01:43

Getting some good results

e9477e4

Add submission: Depth Recurrence (layers 4,5) + TTT

51984ca

Seed 1337 complete (val_bpb=1.1179). Seeds 42 and 2024 need rerun after GPU restart (stale CUDA contexts blocking clean runs). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Finalize 3-seed results: mean val_bpb=1.1182 (seeds 1337,2025,2024)

9c2cad6

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove experiment_results.md from submission

bf96887

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Restore train_gpt.py to match main branch

a9bb2ab

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Clean up submission README: first person, simplify approach section

174142c

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

msisovic changed the title ~~Submission/2026 03 25 recur layers ttt~~ Record: Depth Recurrence (layers 4 and 5 repeated): val_bpb 1.1182 Mar 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Depth Recurrence (layers 4 and 5 repeated): val_bpb 1.1182#686

Record: Depth Recurrence (layers 4 and 5 repeated): val_bpb 1.1182#686
msisovic wants to merge 6 commits intoopenai:mainfrom
msisovic:submission/2026-03-25_RecurLayers_TTT

msisovic commented Mar 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

msisovic commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Reproducibility

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

msisovic commented Mar 25, 2026 •

edited

Loading