openai · msisovic · Mar 25, 2026 · Mar 25, 2026 · Mar 25, 2026 · Mar 25, 2026
diff --git a/records/track_10min_16mb/2026-03-25_RecurLayers/README.md b/records/track_10min_16mb/2026-03-25_RecurLayers/README.md
@@ -0,0 +1,78 @@
+# Depth Recurrence (layers 4,5)
+
+## Score: mean val_bpb = 1.1184 (3 seeds: 1.1179, 1.1191, 1.1183)
+
+Trained on 8xH100 SXM in ~600 seconds. ~15.9MB artifact (int6+lzma).
+
+## Motivation
+
+I explored both width scaling (MODEL_DIM=576) and depth scaling (adding layers) and found that depth consistently wins over width in this regime. A full independent 12-layer model at dim=512 outperformed a wider 11-layer model at dim=576, despite the wider model having more parameters. However, adding independent layers pushes the model over the 16MB artifact budget. Depth recurrence solves this: by re-executing mid-network layers with independent block scalars, I get the depth benefit without the parameter/size cost. Dual recurrence on layers 4 and 5 gives 13 virtual layers from 11 physical, staying well under budget at ~15.9MB.
+
+## Approach
+
+Depth recurrence applied to layers 4 and 5, creating 13 virtual layers from 11 physical layers while keeping parameter count at ~27M. Combined with test-time training (TTT) for additional evaluation-time adaptation.
+
+### Dual Depth Recurrence (layers 4,5)
+Layers 4 and 5 are each executed twice in sequence (pattern: 0,1,2,3,4,5,4,5,6,7,8,9,10), producing 13 virtual layers from 11 physical layers. Each recurrent pass uses independent learnable block scalars, so the model can modulate how the repeated layers behave on their second pass. This adds depth without increasing model size or artifact bytes — only the small block scalar parameters are added (~2K params).
+
+Everything else (TTT, int6 quantization, SWA, bigram embeddings, value embeddings, Muon optimizer, etc.) is inherited from [PR #549](https://github.com/openai/parameter-golf/pull/549).
+
+## Hyperparameters
+
+| Parameter | Value |
+|-----------|-------|
+| num_layers | 11 (physical) / 13 (virtual) |
+| model_dim | 512 |
+| mlp_mult | 3.0 (hidden=1536) |
+| recur_layers | 4, 5 |
+| train_seq_len | 2048 |
+| train_batch_tokens | 786,432 |
+| warmdown_iters | 3500 |
+| matrix_lr | 0.025 |
+| scalar_lr | 0.025 |
+| tied_embed_lr | 0.035 |
+| muon_momentum | 0.99 (warmup from 0.92 over 1500 steps) |
+| muon_weight_decay | 0.04 |
+| adam_weight_decay | 0.04 |
+| grad_clip_norm | 0.3 |
+| eval_stride | 64 |
+| swa_every | 50 |
+| ttt_lr | 0.002 |
+| ttt_epochs | 3 |
+| ttt_chunk_tokens | 32768 |
+| ttt_freeze_blocks | 2 |
+
+## Key Metrics
+
+- **Mean val_bpb: 1.11840** (std: 0.00049)
+- Training: ~6,100 steps in ~600s
+- Model params: ~27M
+- Mean total submission size: 15,931,152 bytes (~15.9MB, int6+lzma)
+
+## Reproducibility
+
+Three independent training runs with different random seeds:
+
+| Seed | val_loss | val_bpb | total_bytes |
+|------|----------|---------|-------------|
+| 1337 | 1.88749538 | 1.11788404 | 15,928,948 |
+| 2025 | 1.88948575 | 1.11906285 | 15,934,932 |
+| 2024 | 1.88811812 | 1.11825287 | 15,929,576 |
+| **Mean** | **1.88836642** | **1.11839992** | **15,931,152** |
+| **Std** | **0.00083132** | **0.00049235** | |
+
+## Run Commands
+
+```bash
+# Seed 1337 (default)
+ITERATIONS=9000 RECUR_LAYERS=4,5 TTT_ENABLED=1 TTT_UNTIE=0 \
+  torchrun --nproc_per_node=8 train_gpt.py
+
+# Seed 2025
+ITERATIONS=9000 RECUR_LAYERS=4,5 TTT_ENABLED=1 TTT_UNTIE=0 SEED=2025 \
+  torchrun --nproc_per_node=8 train_gpt.py
+
+# Seed 2024
+ITERATIONS=9000 RECUR_LAYERS=4,5 TTT_ENABLED=1 TTT_UNTIE=0 SEED=2024 \
+  torchrun --nproc_per_node=8 train_gpt.py
+```
diff --git a/records/track_10min_16mb/2026-03-25_RecurLayers/submission.json b/records/track_10min_16mb/2026-03-25_RecurLayers/submission.json
@@ -0,0 +1,22 @@
+{
+  "author": "Marko Sisovic",
+  "github_id": "msisovic",
+  "name": "Depth Recurrence (layers 4,5) + TTT",
+  "blurb": "Dual depth recurrence on layers 4 and 5 (11 physical -> 13 virtual layers) with tied test-time training. Reuses layer weights to add depth without increasing model size, keeping the artifact under 16MB with int6+lzma compression. Combined with TTT, SWA, bigram embeddings, value embeddings, and Muon optimizer with weight decay.",
+  "date": "2026-03-25T00:00:00Z",
+  "val_loss": 1.88836642,
+  "val_bpb": 1.11839992,
+  "val_loss_std": 0.00083132,
+  "val_bpb_std": 0.00049235,
+  "seeds": [1337, 2025, 2024],
+  "seed_results": {
+    "1337": {"val_loss": 1.88749538, "val_bpb": 1.11788404},
+    "2025": {"val_loss": 1.88948575, "val_bpb": 1.11906285},
+    "2024": {"val_loss": 1.88811812, "val_bpb": 1.11825287}
+  },
+  "step_stop": 6100,
+  "wallclock_seconds": 600.0,
+  "eval_time_seconds": 475,
+  "bytes_total": 15928948,
+  "bytes_code": 95036
+}