Skip to content

Record: XSA-all + Depth Recurrence + Hedge Mixer TTT (val_bpb=1.0278, 3-seed mean)#733

Closed
stukenov wants to merge 2 commits intoopenai:mainfrom
stukenov:submission/v4-hedge-mixer-ttt
Closed

Record: XSA-all + Depth Recurrence + Hedge Mixer TTT (val_bpb=1.0278, 3-seed mean)#733
stukenov wants to merge 2 commits intoopenai:mainfrom
stukenov:submission/v4-hedge-mixer-ttt

Conversation

@stukenov
Copy link

Record: XSA-all + VRL + CROWN-Q + Depth Recurrence + Hedge Mixer TTT

val_bpb = 1.0278 (3-seed mean, std 0.0039) | ~15.8 MB | 8xH100 SXM, 600s train

3-Seed Results

Seed Pre-TTT bpb Post-TTT bpb Artifact
1337 1.1335 1.0235 15,827,512
42 1.1346 1.0289 15,760,352
2025 1.1365 1.0311 15,713,536
Mean 1.1349 1.0278 (std 0.0039)

Key Innovations (6 additions over PR #549)

  1. XSA on all 11 layers (PR Record: 11L XSA-all + Full GPTQ + Parallel Muon + Selective Pruning (val_bpb: 1.1171) #634) — -0.006 BPB
  2. Value Residual Learning (PR Record: 11L LeakyReLU² + VRL + lzma — val_bpb 1.1229 (3-seed mean) #657) — -0.002 BPB
  3. Gated Attention (PR Record: 11L XSA-all + LeakyReLU(0.5)² + VR + GA (val_bpb=1.1164, pending 3-seed) #638) — -0.002 BPB
  4. CROWN-Q (PR Record: CROWN-Q + Full GPTQ + SWA/EMA Blend — val_bpb 1.1186 (3-seed mean) #693) — curvature-weighted quant penalty during warmdown
  5. Depth Recurrence (PR Record: Depth Recurrence (layers 4 and 5 repeated): val_bpb 1.1182 #686) — layers 4,5 repeated = 13 virtual layers from 11 physical
  6. 5-Expert Hedge Mixer (PR Record: 5-expert Hedge Mixer + TTT (3-seed mean val_bpb=1.0745) #688) — GPU-vectorized online context mixing (neural + unigram + bigram + trigram + entropy)

Legal TTT (Score-First)

Every token scored under torch.inference_mode() BEFORE any weight update. Hedge Mixer n-gram tables built from already-scored tokens only. SGD optimizer (not AdamW) for TTT.

Note on eval time

TTT eval takes ~755s (exceeds 600s limit). Reducing TTT_EPOCHS from 3 to 1 brings eval under 600s with expected BPB ~1.08-1.09. Happy to resubmit with 1 epoch if required.

Reproduction

SEED=1337 torchrun --standalone --nproc_per_node=8 \
  records/track_10min_16mb/2026-03-25_v4_XSA11_VRL_CROWNQ_DepthRecur_HedgeMixer_TTT/train_gpt.py

All defaults in the script match the submitted results. No env vars needed.

Credits

PR #549 (@abaybektursun), PR #634 (@raahilshah), PR #657 (@anthony-maio), PR #638 (@Asukabot0), PR #693 (@EthanYangTW), PR #686 (@msisovic), PR #688 (@RoyiRa), PR #493 (@parinzee), PR #414 (@signalrush)

stukenov and others added 2 commits March 25, 2026 20:22
…(val_bpb=1.0278, 3-seed mean)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ucibility

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@stukenov
Copy link
Author

Closing: eval time exceeds 600s limit. Resubmitting with TTT_EPOCHS=1.

@stukenov stukenov closed this Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant