Record: Order-Adaptive 9-gram Backoff + Distributed Prefill — val_bpb 0.4405 (3-seed mean) by sofiabod · Pull Request #890 · openai/parameter-golf

sofiabod · 2026-03-26T19:18:18Z

Record: Order-Adaptive 9-gram Backoff + Distributed Prefill — val_bpb 0.4405 (3-seed mean)

Results

Seed	val_bpb	Artifact	Eval time
42	0.4429	14,899,126 bytes	~586s
1337	0.4381	14,740,261 bytes	~588s
2024	0.4405	15,101,371 bytes	~502s
Mean	0.4405
Std	0.0024

Artifact: < 16,000,000 bytes (all seeds)
Train: 600s on 8xH100 SXM
Eval: < 600s (all seeds)

Method

11-layer transformer (512d, 8/8 full MHA, XSA-all, LeakyReLU(0.5)², 3.5x MLP).
Order-adaptive entropy-gated 9-gram backoff cache with per-order entropy thresholds
and distributed cache prefill. Score-first, backward-looking, deterministic.

Architecture

11L, 512d, full MHA 8/8, MLP 3.5x (1792), LeakyReLU(0.5)²
XSA on all 11 layers, partial RoPE 16/64
BigramHash(4096, 128d), SmearGate, VE128 on layers 9-10
Tied embeddings, logit softcap 30
EMA(0.997) + Tight SWA, Parallel Muon optimizer
int5 per-row quantization + zstd-22 compression
Early QAT (threshold 0.5)

Eval-time N-gram Cache

Multi-order backoff, orders 2-9, 4M hash buckets per order
Dual hash tables per order: context counts + full (context+target) counts
Per-order entropy thresholds: {9: 2.6, 8: 2.8, 7: 3.0, 6: 3.2, 5: 3.5, 4: 3.8, 3: 4.2, 2: 4.5}
Entropy-adaptive alpha: 0.05 + 0.55 * sigmoid(2.0 * (H - threshold))
Alpha range [0.05, 0.60]: low entropy = trust neural, high entropy = trust n-gram
min_count=2, score-first (lookup then update per window)
Distributed prefill: each rank pre-warms cache with all preceding token positions
Sliding window eval with stride=32

Key Insight

Distributed cache prefill is critical — without it, ranks 1-7 start with cold caches,
losing ~60% of n-gram effectiveness. Prefill makes distributed eval equivalent to
single-GPU sequential eval. Combined with 9-gram orders (capturing longer repeated
phrases) and per-order entropy gating (trusting higher orders at lower uncertainty),
this produces a -0.69 BPB gain over neural-only sliding window eval.

Legality

Score-first n-gram cache: Each window batch: (1) lookup cache for predictions,
(2) compute blended loss, (3) update cache with window tokens. Cache only uses
backward-looking tokens that have already been scored. No future data access.
Alpha depends on model entropy only: The mixing weight uses the neural model's
output entropy, not the target token. No oracle/hindsight selection.
No TTT: Test-time training is disabled (TTT_EPOCHS=0).
No GPTQ at eval time: Quantization completes within the training budget.
No reordering: Evaluation set processed in original sequential order.
Deterministic: Given the same seed, produces identical results.

Acknowledgments

Huge thanks to the incredible community:

@abaybektursun (PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549) — base architecture + Legal TTT + Parallel Muon
@deanbrr (PR Record: 5-gram Eval Cache + LeakyReLU² + Parallel Muon val_bpb: 1.0920 (3-seed mean, std 0.0007) | ~15.9 MB | 8×H100 SXM #659, Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) #779) — invented the n-gram eval cache, BackoffNgramMixer
@Asukabot0 (PR Record: XSA-all + LeakyReLU² + VR + GA + 7-gram cache (val_bpb=1.0337) #715, Record: First Legal Sub-1.0 BPB — Multi-order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.9674, 3-seed) #727) — entropy-adaptive alpha formula
@Robby955 (PR Record: 0.2292 BPB — Dirichlet-Multinomial Smoothing + Distributed Prefill + 15-Gram + EBLS #796) — distributed cache prefill technique
@hypery11 (PR Record: 11L + order-adaptive 9-gram backoff (mean val_bpb=0.9059) #788, Record: 11L + order-adaptive 11-gram (mean val_bpb=0.8881) #795, Record: Order-Adaptive BackoffMixer (mean val_bpb=0.5440) #825) — order-adaptive entropy gating, 9-gram extension
@newjordan (PR Podracing II: Electric Bugaloo — 0.9625 BPB (3-seed mean, all sub-0.964) #753, Podracing III: Cubric Lite — 0.9362 BPB #782) — multi-order backoff, per-order alpha scaling
@travispchen (PR Record: Order-Adaptive Entropy Gating + BackoffNgramMixer (val_bpb=0.5466) #798) — per-order entropy thresholds
@gowtham0992 (PR Record: int5 GPTQ + Soft-Round QAT (3-seed mean 1.1162) #606) — int5 + QAT
@signalrush (PR Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414) — EMA training recipe
@thwu1 (PR Record: 10L Int5-MLP + BigramHash(10240) + SWA(0.4) + WD=0.04 (val_bpb=1.1428, mean 3 seeds) #180) — mixed quantization, BigramHash, SmearGate
@raahilshah (PR Record: Int6 MLP3x + SmearGate + BigramHash + MuonWD + SWA (mean val_bpb=1.1483) #162) — int6 quantization foundation

- add BigramHash(2048,128) with zero-init and learnable scale - add SmearGate: per-dim gate blending with prev token - weight decay 0.04 on Muon (leaderboard standard) - muon_momentum 0.99 (from 0.95, leaderboard standard) - best config baked in: 7L mlp_mult=3 seq_len=4096 etc - bigram/smear params explicitly added to optimizer groups

- add forward_logits() method to GPT for eval without loss computation - add eval_val_sliding() with configurable stride (default 64) - each scored token gets ~4032 tokens of context instead of ~2048 average - eval-only change: no training modifications, no artifact size change - expected ~0.03 BPB improvement in reported score

- init attn_scale and mlp_scale to 1/sqrt(layer_idx+1) instead of 1.0 - deeper layers get smaller residual contributions, stabilizes training - zero extra params, zero compute overhead - used by all top submissions per vault research

- apply rotary embeddings to first 16 dims of 64 head_dim (25%) - remaining 48 dims are position-free, improving generalization - zero extra params, used by all top submissions per vault research - configurable via ROPE_DIMS env var (0=all, default=16)

- TTT: 5 epochs at lr=0.0005 (matching SOTA PR openai#442) - use DDP model for TTT forward pass to sync gradients across GPUs - shard validation tokens across ranks for proper distributed TTT - batch size 4 seqs/GPU, modal timeout 1800s

- legal score-first TTT: score chunk, then adapt on scored tokens (1 seq to avoid OOM) - SGD+momentum, freeze early 2 blocks, 3 epochs, lr=0.005, adapt every 4 batches - GPTQ-lite: test 5 clip percentiles per row, pick best MSE - Tight SWA: collect 12 checkpoints when lr_scale<0.2, average before export - int8 with SWA+GPTQ: 1.1787 (improved from 1.1802)

- 11 layers, XSA on last 4, int6 quantization + zstd-22 - EMA(0.997), GPTQ-lite, Tight SWA, Late QAT@0.15 - Partial RoPE 16/64, LN Scale 1/sqrt(layer+1) - SmearGate + BigramHash(2048,128), VE128 on layers 9,10 - Muon WD=0.04, momentum=0.99, matrix_lr=0.025 - SDPA fallback (no FA3), batch 786K, seq 2048 - add zstandard to Modal image

- flash-attn requires GPU for compilation, Modal builds without GPU - keeping SDPA fallback, ~101ms/step - still have FA3 import attempt in code for when it becomes available

- attempt flash-attn pip install at runtime with 120s timeout - still falls back to SDPA if install fails - 101ms/step with SDPA, ~84ms with FA3

- replace relu(x)^2 with leaky_relu(x, 0.5)^2 - PR openai#493 reaches 1.1309 with partial stack using this activation - untried on full openai#414 stack — could give -0.002 to -0.005 BPB - zero param cost, zero speed overhead

…enai#486) - 30 epochs AdamW(lr=0.0005) on val tokens with cosine LR decay - per-layer LR: 3x for mlp.proj (high quant error), 0.5x for mlp.fc - DDP gradient sync via all_reduce(AVG) + grad clip 1.0 - keep LeakyReLU(0.5)^2 from exp48 - expected: ~0.06 BPB gain (1.127 → ~1.07) - modal timeout 3600s for 30-epoch TTT

- TTT_MODE=preeval (default): bulk train then score (max BPB, may be invalid) - TTT_MODE=legal: score chunk first, then train on scored tokens (valid for records) - legal TTT unfreezes last 2 blocks + norms + scales + embeddings - 1528 lines (over 1500 baseline limit but OK for records folder)

…ual hash tables, per-window score-first, entropy-adaptive alpha, tc>0 check)

…gramHash 6144, int5, stride=32) + 9-gram prefill

…_bpb=0.4405, 3 seeds)

sofiabod added 25 commits March 18, 2026 14:34

initial

45422a6

add modal launcher for 8xh100 training

f13c234

fix md + tests

7df4c4b

exp42: SDPA only (flash-attn build fails on Modal)

8341935

- flash-attn requires GPU for compilation, Modal builds without GPU - keeping SDPA fallback, ~101ms/step - still have FA3 import attempt in code for when it becomes available

exp44: try flash-attn runtime install + SDPA fallback

be8b359

- attempt flash-attn pip install at runtime with 120s timeout - still falls back to SDPA if install fails - 101ms/step with SDPA, ~84ms with FA3

exp52: n-gram cache 4M buckets, single eval pass, fix zero-prob mixing

987b26b

exp54: 5-gram fixed alpha=0.2 cache (PR openai#769 recipe)

dcc4f69

exp55: truly sequential n-gram (fix chunking stale-count bug)

14d5771

exp56: dict-based n-gram cache (zero collisions), fixed alpha=0.05

1960721

exp57: multi-order backoff 2-5 gram dict cache, alpha=0.2

7928232

exp58: rewrite n-gram to match PR openai#753/openai#769/openai#779 (d…

9cd7357

…ual hash tables, per-window score-first, entropy-adaptive alpha, tc>0 check)

exp59: 9-gram + per-order entropy thresholds + distributed prefill

759dfa7

exp60: adopt PR openai#825 full stack (MHA 8/8, MLP 3.5x, XSA-all, Bi…

40eb1ed

…gramHash 6144, int5, stride=32) + 9-gram prefill

exp61: submission-ready (BigramHash 4096, skip diag evals, int5 QAT)

738ffaa

Record: Order-Adaptive 9-gram Backoff + Distributed Prefill (mean val…

1a2ac56

…_bpb=0.4405, 3 seeds)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Order-Adaptive 9-gram Backoff + Distributed Prefill — val_bpb 0.4405 (3-seed mean)#890

Record: Order-Adaptive 9-gram Backoff + Distributed Prefill — val_bpb 0.4405 (3-seed mean)#890
sofiabod wants to merge 25 commits intoopenai:mainfrom
sofiabod:autoresearch/mar22

sofiabod commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sofiabod commented Mar 26, 2026

Record: Order-Adaptive 9-gram Backoff + Distributed Prefill — val_bpb 0.4405 (3-seed mean)

Results

Method

Architecture

Eval-time N-gram Cache

Key Insight

Legality

Acknowledgments

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant