Skip to content

11L EMA + GPTQ-lite + Legal Score-First TTT (1.1408 BPB)#691

Open
xexyz wants to merge 3 commits intoopenai:mainfrom
xexyz:xexyz/pr414-cosine-ttt-30ep
Open

11L EMA + GPTQ-lite + Legal Score-First TTT (1.1408 BPB)#691
xexyz wants to merge 3 commits intoopenai:mainfrom
xexyz:xexyz/pr414-cosine-ttt-30ep

Conversation

@xexyz
Copy link

@xexyz xexyz commented Mar 25, 2026

Summary

  • val_bpb: 1.1408 (legal score-first TTT, stride-64)
  • Artifact size: 15,758,590 bytes (under 16MB)
  • 8xH100 SXM, seed=1337

Technique

Legal chunk-based score-first TTT on the PR #414 consensus stack. Each validation chunk is scored first (under inference_mode), then the model trains on the already-scored tokens. Never trains on tokens before scoring them.

Legal TTT Protocol

  1. Split validation data into 32K-token chunks (1893 chunks)
  2. For each chunk: score with sliding-window eval, then train for 3 epochs
  3. SGD optimizer, base LR=0.002, momentum=0.9
  4. Cosine LR decay across chunks
  5. DDP gradient sync (all_reduce AVG), grad clip 1.0

Architecture (PR #414 stack)

  • 11 layers, 512d, 8H, 4KV (GQA), relu²
  • SmearGate + BigramHash, XSA on last 4 layers
  • EMA(0.997) + Tight SWA, GPTQ-lite int6+zstd-22

Results

Stage val_loss val_bpb
Post-EMA (float) 1.9433 1.1509
Post-int6 roundtrip 1.9570 1.1590
Legal TTT (score-first) 1.9262 1.1408

Credits

xexyz added 3 commits March 25, 2026 01:05
30-epoch cosine pre-eval Test-Time Training on PR openai#414 consensus stack.
Adapts quantized model on validation data before sliding-window eval.

- Pre-TTT post-quant: 1.1594 BPB
- Post-TTT sliding (stride=64): 1.0988 BPB
- Total artifact: 15,900,191 bytes (under 16MB)
- 5434 training steps + 30ep TTT + sliding eval on 8xH100

Built on PR openai#414 by @signalrush. TTT recipe from PR openai#518/@sofiabod, PR openai#672/@andrewbaggio1.
Updated bytes_code from 71379 to 71596 to match train.log and actual wc -c.
Replaced pre-eval TTT with legal chunk-based score-first protocol:
- Score each 32K-token chunk first, then train on scored tokens
- SGD lr=0.002, momentum=0.9, 3 epochs/chunk, cosine LR
- Never trains on tokens before scoring them
- Added FA3 fallback (flash_attn_interface -> flash_attn -> SDPA)
- Fixed RoPE cache inference tensor issue between score/train phases

Legal TTT val_bpb: 1.1408, eval time: 617s on 8xH100 (SDPA).
Artifact size: 15,758,590 bytes (under 16MB).
@xexyz xexyz changed the title PR #414 + 30-Epoch Cosine TTT (1.0988 BPB) PR #414 + Legal Score-First TTT (1.1408 BPB) Mar 25, 2026
@xexyz xexyz changed the title PR #414 + Legal Score-First TTT (1.1408 BPB) 11L EMA + GPTQ-lite + Legal Score-First TTT (1.1408 BPB) Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant