11L EMA + GPTQ-lite + Legal Score-First TTT (1.1408 BPB) by xexyz · Pull Request #691 · openai/parameter-golf

xexyz · 2026-03-25T07:06:05Z

Summary

val_bpb: 1.1408 (legal score-first TTT, stride-64)
Artifact size: 15,758,590 bytes (under 16MB)
8xH100 SXM, seed=1337

Technique

Legal chunk-based score-first TTT on the PR #414 consensus stack. Each validation chunk is scored first (under inference_mode), then the model trains on the already-scored tokens. Never trains on tokens before scoring them.

Legal TTT Protocol

Split validation data into 32K-token chunks (1893 chunks)
For each chunk: score with sliding-window eval, then train for 3 epochs
SGD optimizer, base LR=0.002, momentum=0.9
Cosine LR decay across chunks
DDP gradient sync (all_reduce AVG), grad clip 1.0

Architecture (PR #414 stack)

11 layers, 512d, 8H, 4KV (GQA), relu²
SmearGate + BigramHash, XSA on last 4 layers
EMA(0.997) + Tight SWA, GPTQ-lite int6+zstd-22

Results

Stage	val_loss	val_bpb
Post-EMA (float)	1.9433	1.1509
Post-int6 roundtrip	1.9570	1.1590
Legal TTT (score-first)	1.9262	1.1408

Credits

Base model and training: PR Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414 by @signalrush
Legal TTT protocol: PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549
TTT technique: PR Record: 11L XSA4 + LeakyReLU(0.5)² + Cosine TTT 50ep (val_bpb=1.0622) #518 by @sofiabod
SDPA fallback for non-FA3 environments

@signalrush

30-epoch cosine pre-eval Test-Time Training on PR openai#414 consensus stack. Adapts quantized model on validation data before sliding-window eval. - Pre-TTT post-quant: 1.1594 BPB - Post-TTT sliding (stride=64): 1.0988 BPB - Total artifact: 15,900,191 bytes (under 16MB) - 5434 training steps + 30ep TTT + sliding eval on 8xH100 Built on PR openai#414 by @signalrush. TTT recipe from PR openai#518/@sofiabod, PR openai#672/@andrewbaggio1.

Updated bytes_code from 71379 to 71596 to match train.log and actual wc -c.

Replaced pre-eval TTT with legal chunk-based score-first protocol: - Score each 32K-token chunk first, then train on scored tokens - SGD lr=0.002, momentum=0.9, 3 epochs/chunk, cosine LR - Never trains on tokens before scoring them - Added FA3 fallback (flash_attn_interface -> flash_attn -> SDPA) - Fixed RoPE cache inference tensor issue between score/train phases Legal TTT val_bpb: 1.1408, eval time: 617s on 8xH100 (SDPA). Artifact size: 15,758,590 bytes (under 16MB).

xexyz added 3 commits March 25, 2026 01:05

Fix bytes_code in submission.json to match actual file size

0a2a1ac

Updated bytes_code from 71379 to 71596 to match train.log and actual wc -c.

xexyz changed the title ~~PR #414 + 30-Epoch Cosine TTT (1.0988 BPB)~~ PR #414 + Legal Score-First TTT (1.1408 BPB) Mar 25, 2026

xexyz changed the title ~~PR #414 + Legal Score-First TTT (1.1408 BPB)~~ 11L EMA + GPTQ-lite + Legal Score-First TTT (1.1408 BPB) Mar 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

11L EMA + GPTQ-lite + Legal Score-First TTT (1.1408 BPB)#691

11L EMA + GPTQ-lite + Legal Score-First TTT (1.1408 BPB)#691
xexyz wants to merge 3 commits intoopenai:mainfrom
xexyz:xexyz/pr414-cosine-ttt-30ep

xexyz commented Mar 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xexyz commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Technique

Legal TTT Protocol

Architecture (PR #414 stack)

Results

Credits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xexyz commented Mar 25, 2026 •

edited

Loading