Skip to content

Record: XSA-all + Depth Recurrence + Hedge Mixer TTT (val_bpb=1.0222, 3-seed mean)#745

Open
stukenov wants to merge 1 commit intoopenai:mainfrom
stukenov:submission/v4-final-1epoch
Open

Record: XSA-all + Depth Recurrence + Hedge Mixer TTT (val_bpb=1.0222, 3-seed mean)#745
stukenov wants to merge 1 commit intoopenai:mainfrom
stukenov:submission/v4-final-1epoch

Conversation

@stukenov
Copy link

Record: XSA-all + VRL + CROWN-Q + Depth Recurrence + Hedge Mixer TTT

val_bpb = 1.0222 (3-seed mean, std 0.0067) | <16 MB | 8xH100 SXM | 600s train, 507s eval

3-Seed Results

Seed Pre-TTT bpb Post-TTT bpb TTT time Artifact
1337 1.1336 1.0201 507s 15,857,972
42 1.1339 1.0165 508s 15,846,228
2025 1.1369 1.0299 507s 15,669,888
Mean 1.1348 1.0222 (std 0.0067) 507s

Compliance

  • Training: 600s on 8xH100 SXM
  • Eval (TTT + sliding): 507s on 8xH100 SXM (under 600s limit)
  • All artifacts under 16,000,000 bytes
  • Score-first TTT: every token scored under torch.inference_mode() before any weight update
  • N-gram tables built from already-scored tokens only
  • No training data access during evaluation
  • GPTQ-lite: no calibration data needed

6 Additions Over PR #549

  1. XSA on all layers (PR Record: 11L XSA-all + Full GPTQ + Parallel Muon + Selective Pruning (val_bpb: 1.1171) #634) — -0.006 BPB
  2. Value Residual Learning (PR Record: 11L LeakyReLU² + VRL + lzma — val_bpb 1.1229 (3-seed mean) #657) — layer 0 V blended via sigmoid gates
  3. Gated Attention (PR Record: 11L XSA-all + LeakyReLU(0.5)² + VR + GA (val_bpb=1.1164, pending 3-seed) #638) — per-head sigmoid gates
  4. CROWN-Q (PR Record: CROWN-Q + Full GPTQ + SWA/EMA Blend — val_bpb 1.1186 (3-seed mean) #693) — curvature-weighted quant penalty during warmdown
  5. Depth Recurrence (PR Record: Depth Recurrence (layers 4 and 5 repeated): val_bpb 1.1182 #686) — layers 4,5 repeated, 13 virtual from 11 physical
  6. 5-Expert Hedge Mixer (PR Record: 5-expert Hedge Mixer + TTT (3-seed mean val_bpb=1.0745) #688) — online mixing of neural + unigram + bigram + trigram + entropy experts via Hedge algorithm

Reproduction

SEED=1337 torchrun --standalone --nproc_per_node=8 train_gpt.py

All defaults in the script match the submitted results. No env vars needed.

Credits

PR #549 (@abaybektursun), #634 (@raahilshah), #657 (@anthony-maio), #638 (@Asukabot0), #693 (@EthanYangTW), #686 (@msisovic), #688 (@RoyiRa), #493 (@parinzee), #414 (@signalrush)

… 3-seed mean)

Training: 600s, Eval: 507s — both within limits.
3 seeds: 1.0201, 1.0165, 1.0299 (mean 1.0222, std 0.0067)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant