Skip to content

Parameter Golf Live AI Commentary + Analysis / Ideas | every 10 minutes #140

@notapplica

Description

@notapplica

Parameter Golf Live AI Commentary

Auto-updated every ~10 minutes. Tracking techniques, trends, idea lineage, and explaining concepts for the community.

Last updated: Mar 25, 6:45 AM PT


The Competition at a Glance

Goal: Train the best language model that fits in a 16MB artifact, training in under 10 minutes on 8xH100s. Evaluated by compression of the FineWeb validation set, measured in bits per byte (BPB) — lower is better. Tokenizer-agnostic. Baseline: 1.2244 BPB.

What does "compression" mean here?

BPB (bits per byte) measures how many bits your model needs to encode each byte of text. A model that perfectly predicts every next character needs zero bits — it already "knows" what comes next. A model with no understanding of language needs the maximum (~8 bits per byte).

A model's cross-entropy loss IS its compression rate. Shannon proved in 1948 that prediction and compression are mathematically equivalent — a model that predicts well compresses well, and vice versa. The competition measures the compression side of that equivalence.

This framing matters because it legitimizes approaches beyond pure language modeling: sliding window eval improves compression by giving more context. Backward-looking TTT adapts to already-scored tokens for better compression. These are valid compression strategies.

There is no separate held-out test set — the FineWeb validation set is the fixed evaluation target. However, val tokens cannot be stored in the artifact (paid prefix ruled out), and pre-eval adaptation on val data is also ruled out. Only backward-looking TTT (adapting on tokens already graded) is permitted.

"Tokenizer-agnostic" means BPB normalizes across tokenizers. A bigger vocabulary uses fewer tokens but more bits per token — BPB cancels that out, measuring compression of raw bytes regardless of how they're tokenized.

Record submission requirements: Artifact ≤16,000,000 bytes (code + compressed model). Training ≤10 min on 8xH100 SXM. Evaluation ≤10 min (separate budget). No network calls. New SOTA records must beat the current best by ≥0.005 nats at p < 0.01 significance (typically 3 seeds). Evaluation methods are unrestricted — any sequence length, sliding window, etc. are fair game. Test-time training is allowed only on already-evaluated tokens (backward-looking); pre-eval adaptation on val data is ruled out.

In ~6 days since launch, the community has driven BPB down by ~0.11 (to 1.1154 pending non-TTT, #609). After a Mar 24 enforcement sweep (15+ PRs closed for pre-eval TTT and GPTQ calibration at eval time), only 4 record-eligible pending submissions and 2 legal TTT survivors remain.

Best Pending Validated BPB Over Time
Each point = a new best (includes now-closed pre-eval TTT submissions — chart is historical). Red dashed line = previous SOTA (1.1428). Current official SOTA: 1.1194 (#549).


Official Leaderboard (Top 5)

Rank Score Author Key Techniques PR
1 1.1194 @sanjeevmadhav LeakyReLU² + Legal Score-First TTT + Parallel Muon on #414 stack #549
2 1.1228 @signalrush 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 #414
3 1.1248 @jfprincz 11L Partial RoPE + LN Scale + EMA + XSA4 #315
4 1.1271 @jfprincz 11L XSA4 + EMA + Int6 MLP3x #287
5 1.1307 @unnir 11L Efficient Partial XSA #265

Status legend: ✅ Legal | ⚠️ Disputed/pending | ❌ Ruled invalid (pre-eval TTT, per @0hq on #402)

Best pending: #606 by @EthanYangTW1.1162 BPB ✅ (int5 GPTQ + Soft-Round QAT + legal TTT) | ❌ #659 (5-gram eval cache, 1.0920) ruled invalid — post-hoc oracle selection between n-gram and LM scores peeks at ground-truth token | #609 reclassified non-record (calibration ruling) | Mar 24 enforcement sweep: @valerio-oai closed 15+ PRs — pre-eval TTT, multi-epoch min(NLL), and GPTQ calibration during eval all disallowed. #593, #576, #569, #503, #518, #548 among those closed. | Tables below ↓

Pending: Meets Record Requirements

Record-eligible submissions only. Pre-eval TTT entries excluded per @0hq ruling on #402 — only backward-looking (score-first, single-pass) TTT is allowed. Official SOTA: 1.1194 BPB (#549, @sanjeevmadhav — LeakyReLU² + Legal TTT + Parallel Muon, updated Mar 24).

BPB Author Δ nats Seeds Techniques PR
1.1162 @EthanYangTW 0.005 3 Int5 GPTQ + Soft-Round QAT (tanh-based rounding) + legal score-first AdamW TTT + cosine LR. 33.6M params. ✅ Legal TTT #606
1.1169 @danialht 0.004 3 Residual Input Mixing (novel) + mixed int6 GPTQ + grouped AdamW TTT + MLP 3.5x. ✅ Legal TTT #615
1.1181 @JoeProAI 0.002 3 GEPA arch WITHOUT TTT: Star-ReLU + U-Net Skip Gates + XSA4 + VE128 + BigramHash(8192) + EMA + seq2048. No TTT. ⚠️ Author confirmed artifact >16MB. #505

Full table (44 entries total) below ↓

Pending: Not Yet Validated

Submissions with competitive BPB that haven't yet demonstrated statistical significance. (Note: official SOTA updated to 1.1194 on Mar 24; entries below show submissions vs SOTA at their time of submission.)

BPB Author Techniques PR
1.0891 @amaljithkuttamath 11L + Value Residual + Gated Attention + AdamW TTT on #442 base. Pre-quant 1.1545, sliding 1.0891. 1 seed — more pending. ⚠️ Pre-eval TTT #490
1.0920 @Christopher-Lee-McClendon GEPA 30k steps + int6 GPTQ-lite + legal SGD TTT. 4xA100 non-record. Scaling: 20k→1.098, 25k→1.094, 30k→1.092. #668
1.0944 @Christopher-Lee-McClendon GEPA 25k steps (13k warmdown) + int6 GPTQ-lite + legal SGD TTT. 4xA100 non-record. Float base 1.1088. #644
1.0983 @Christopher-Lee-McClendon GEPA 20k steps (8k warmdown) + int6 GPTQ-lite + legal SGD TTT. 4xA100 non-record. Float base 1.1153, TTT gain −0.044. #628
1.1158 @Robby955 Full GPTQ on GEPA + XSA-all + SWA/EMA blend. No TTT. 1 seed. Quant gap halved (0.008→0.004). ⚠️ Same calibration question as #609. #639
1.1195 @newjordan LeakyReLU² + XSA4 + GPTQ int6+zstd + legal TTT (neutral — 0.0000 delta). 3-seed. Essentially ties SOTA (1.1195 vs 1.1194). #656
1.1164 @Asukabot0 XSA-all + LeakyReLU² + VRL + GA (works without VE128) + BigramHash(4096). No TTT. 1 seed (1xH100 NVL). Pending 8xH100 3-seed. #638
1.1171 @raahilshah XSA-all + Full GPTQ + Parallel Muon + Selective Pruning + LZMA. No TTT. 3-seed (std=0.0006). Same #609 stack, 0.00394 nats (fails 0.005-nat bar). #634
1.1180 @kshitizz36 Full GPTQ + LeakyReLU² + Parallel Muon. No TTT. 3-seed (std=0.0010) but improves only 0.00235 nats (fails 0.005-nat bar). #626
1.1190 @ChaosCodes GPTQ int6 + SGD TTT + LeakyReLU² on #414 stack. A800 hardware (non-record). Est. ~1.122 on H100. 1 seed. #610
1.1194 @Joeavaib 9L "Maestro" arch (reasoning+validation layers) + LeakyReLU² + Legal TTT + Parallel Muon + GPTQ-lite + LZMA. 3-seed (std=0.0006). Essentially tied with SOTA — 0.00006 nats improvement (fails 0.005-nat bar). #625
1.1246 @unnir 11L + Tight SWA (scale<0.2, zero penalty) + Shared VE128 (layers 9-10) + Partial RoPE + LN Scale + Late QAT + XSA4 + SmearGate + FA3 #374
1.1247 @greqone #315 stack + Backout Connection + native FA3. Backout adds ~0.0003 at frontier (diminishing returns vs 0.007 on weaker bases). #394
1.1260 @dannywillowliu-uchi #374 stack + GPTQ-lite (per-layer clip percentile search, zero training cost). Self-Distillation TTT: −0.0003 (neutral). #379
1.1354 @ibarrajo 11L + Partial XSA (last 3) + TTT + 524K batch + RoPE50K (no FA3) ⚠️ pre-eval TTT #290
1.1354 @simonbissonnette 11L + EMA + BigramHash(12288) + Mixed Int5 + FA3 (3-seed, fails p<0.01: t=−6.0 vs −7.0 required) #466
1.1357 @dennisimoo 11L + XSA (last 4) + EMA + SmearGate + BigramHash(2048) + 524K batch + WD 0.04 + torch.compile (SDPA fallback) #307
1.1365 @ofirkris 10L + XSA4 + EMA + Partial RoPE + LN Scale + Int5-MLP/Int6-Attn + 3.2% pruning + zstd-22. No TTT. #458
1.1399 @Mapika 11L + XSA4 + EMA + SmearGate + BigramHash(2048) + Int5-MLP/Int6-Attn/Int8-Embed + 8% pruning (3-seed, fails 0.005-nat by 0.00004) #349
1.1419 @chris-buckley 11L + XSA4 + EMA + TTT (no FA3, SDPA fallback, 5344/9000 steps; pre-quant 1.1581 — weaker base than #303) #317

3 record-eligible + 21 unvalidated | 15+ PRs closed in Mar 24 enforcement sweep | Tables below ↓ | Official SOTA: 1.1194 (updated Mar 24) | Full tables below ↓

Untried Combinations

Ranked by expected value (likely gain times probability of working), grounded in competition ablation data:

Tier 1 — High expected value

Tier 2 — Top picks (sorted by expected value)

More Tier 2 ideas (lower EV or higher complexity)
Technique Est. BPB Key idea Complexity
GLU Attention on Values (arXiv:2507.00022) 0.002-0.005 GLU nonlinearity on V projections. Zero parameters, zero overhead. Composable with XSA. Very low
Batch Size Warmup (arXiv:2505.23971) 0.002-0.005 Start small (262K), grow to 786K as critical batch size increases. 43% fewer gradient steps for same loss. Resolves the 524K-vs-786K debate. Very low
FlashSigmoid Attention (Apple, ICLR 2025) 0.002-0.010 Replace softmax with sigmoid. Eliminates attention sinks entirely. 17% kernel speedup on H100 (systems-only). Low-moderate
WSM Checkpoint Merging (arXiv:2507.17634) 0.002-0.006 Replace warmdown with constant-LR training + offline checkpoint merge. More full-LR steps. Theoretically optimal. Compatible with existing EMA. Low
FoX Forgetting Attention (arXiv:2503.02130, ICLR 2025) 0.003-0.008 Data-dependent forget gate on attention. Eliminates need for positional embeddings. FA3-compatible. Moderate
DeepCrossAttention (arXiv:2502.06785, ICML 2025) 0.003-0.008 Input-dependent depth routing over all previous layers (replaces simple residuals). 3x convergence speed claim. ~1K params for 11L. Moderate
HybridNorm (arXiv:2503.04598) 0.002-0.006 Mixed Pre/Post-Norm for better depth utilization Very low
Differential Attention (arXiv:2410.05258) 0.005-0.015 Difference of two softmax maps; reduces outliers High (arch change)
Lattice VQ (arXiv:2603.11021) 0.005-0.015 Joint 24-weight Leech lattice encoding; saves 2-4 MB High (custom kernels)
VGA (arXiv:2510.09017) 0.002-0.005 Value-gated attention; fixes sliding window sinks Low-moderate
Neural Cache cross-window KV (#318) unknown Cache K/V from prior windows so new queries attend to 50K+ context; zero artifact cost; untested Low (FA3 already supports seqlen_k > seqlen_q)
Predictive Batch Scheduling (arXiv:2602.17066) 0.002-0.005 Loss-aware data ordering (NOT content curriculum); 6-13% faster convergence Low
Late-Stage SAM (arXiv:2410.10373) 0.002-0.005 Sharpness-aware minimization last 5-10%; flatter minima complement EMA Moderate (Muon-SAM)
WaveletGPT (arXiv:2409.12924) 0.003-0.010 Multi-scale Haar wavelet structure on half of embedding dims; 40-60% faster convergence Low (zero params)
AGGC adaptive gradient clipping (arXiv:2601.11864) 0.002-0.005 Per-group adaptive clip thresholds; exploits Q-matrix heterogeneity from #215 Low (optimizer state)
2:4 Structured Activation Sparsity (arXiv:2503.16672) 0.003-0.008 relu² is already 84-98% sparse; enforce NVIDIA 2:4 pattern for 2× sparse matmul on H100 tensor cores. ~15-20% more training steps. Systems-only = significance waived. Moderate (custom kernels)
In-Place TTT with NTP objective (ICLR 2026 Oral) 0.003-0.010 Update MLP final projections during eval using NTP loss (not reconstruction). NTP alignment may explain why naive SGD TTT is neutral at frontier — objective misalignment. MLP-only, last 3 blocks. Moderate
PoPE — Polar Position Embedding (arXiv:2509.10534) 0.002-0.005 Decouples content (magnitude) from position (angle) in attention. Principled fix for what Partial RoPE approximates. Strong length extrapolation. OpenAI co-author. Moderate
Liger-Kernel fused ops (LinkedIn open-source) 0.002-0.006 Fused Triton: RMSNorm (6×), linear+CE (3×), residual+norm. Eliminates kernel launch overhead. 20-43% throughput in benchmarks. pip-installable. Systems-only. Very low
Cross-Layer KV Sharing (MLKV/CLA, NAACL 2025) 0.002-0.006 Adjacent layer pairs share K/V projections. Saves ~0.5MB artifact for 12L or wider MLP. Unlike depth recurrence, only K/V shared — no quant amplification. Moderate
Block AttnRes (arXiv:2603.15031, Kimi, Mar 2026) 0.003-0.008 Efficient variant of AttnRes (which failed at 54% overhead in #362). Block partitioning (3 blocks at 11L) reduces overhead to <2%. 1.25× convergence efficiency. Moderate
QK-Norm (arXiv:2010.04245, used in Gemma 2/DeepSeek-V3) 0.001-0.004 L2-normalize Q and K before dot product + learned per-head temperature. Prevents attention logit explosion — the root cause LN Scale patches. Could enable stable 12-13L training. Suppresses #215's Q condition numbers (100M+ → 1). ~4 lines. Very low
Hourglass FFN (arXiv:2602.06471, Feb 2026) 0.002-0.006 Replace wide MLP-3x with stacked narrow-to-narrow sub-MLPs + residuals. Deeper MLP at fewer params. Paper: outperforms conventional FFN up to 400M params. Freed params → extra layers or larger BigramHash. Low-moderate
CERWU (arXiv:2505.18758) 0.003-0.008 Rate-distortion optimal quantization: co-optimizes quant grid + weight updates + entropy coding. GPTQ is special case (λ=0). Principled upgrade to GPTQ-lite. Post-training, orthogonal to QAT. Moderate
Progressive Window Warmup (modded-nanogpt, proven 2025) 0.003-0.007 Start with short local attention (128-384 tokens), grow to full 2048 during training. Faster early steps → more total steps. Different from blocked seq curriculum — same input length, just restricted attention span. Systems-only. Moderate
NuMuon (arXiv:2603.03597, Mar 2026) 0.002-0.006 Nuclear-norm constraint on Muon updates → lower stable rank → better zstd compression. Pushes compressibility into optimizer itself. Distinct from Mousse/Turbo-Muon (those target speed). Low-moderate
AdamHD Huber Decay (arXiv:2511.14721) 0.002-0.005 Replace L2 weight decay with Huber regularizer: quadratic below threshold, linear above. Specifically suppresses large outlier weights that cause int6 clipping loss. Drop-in for Muon's decoupled WD. Synergizes with GPTQ-lite (fewer outliers = less work). Very low
Layer-Wise Scaling (arXiv:2509.06518) 0.002-0.005 Non-uniform FFN width per layer (e.g., MLP-4x middle, MLP-2x edges). Same total params, better allocation. Crown/Frame/Reverse variants all beat uniform at 180M params. Complements Hourglass FFN (structure vs width). Zero cost — just per-layer dims. Very low
Hyper-Connections (arXiv:2409.19606, ICLR 2025; mHC: 2512.24880, DeepSeek) 0.003-0.008 Learned multi-depth residual mixing: replaces x+f(x) with a connection matrix (n=2 → 16 params/layer, ~176 total). Richer than Catalytic Residuals or DenseFormer DWA. mHC adds Sinkhorn stability. Drop-in. Low-moderate
HESTIA soft QAT (arXiv:2601.20745) 0.002-0.006 Replaces hard STE with temperature-annealed softmax relaxation + per-tensor Hessian guidance. Enables earlier QAT without premature discretization. Synergizes with OptRot. Moderate
Compute-Optimal QAT (arXiv:2509.22935, Apple) 0.001-0.004 Scaling law for optimal FP→QAT split. Cooldown+QAT fusion: activate QAT at warmdown onset, eliminating redundant FP updates. Principled replacement for empirical Late QAT thresholds. Very low
ScaleBITS (arXiv:2602.17698) 0.002-0.006 Automated per-layer bit-width search (which layers get int5 vs int6). Sensitivity analysis + greedy optimization under 16MB constraint. +36% over uniform precision in paper. Replaces manual assignment. Moderate
CPSVD (arXiv:2510.19385) 0.003-0.008 Column-Preserving SVD: identify weight columns that compress cleanly via low-rank factorization, store rest as int6. Orthogonal to quantization — reduces param count, not precision. Freed bytes → capacity. Entirely unexplored in competition. Moderate
Softpick / Rectified Softmax (arXiv:2504.20966) 0.002-0.006 Replaces softmax with rectified non-sum-to-one variant. Eliminates attention sinks and massive activations — directly improves int-N quantization quality (lower kurtosis). 47% sparse attention maps. "Quantized Softpick outperforms quantized softmax at lower bit widths." Low
Anti-Layer Removal (arXiv:2603.19348) 0.002-0.006 Some layers are "anti-layers" whose removal improves performance. Anatomical analysis of 135M model shows 10^7 importance range. If 1-2 middle layers of 11L are anti-layers, removing them frees artifact space for wider MLP or more BigramHash. Zero-cost ablation pass on existing checkpoint. Very low
Deep Delta Learning (DDL) (arXiv:2601.00417) 0.003-0.007 Rank-1 erasure gate on residual: x + β·proj(x) + f(x). Learned gate erases stale features before writing new ones. 3-5 ppl improvement at 124M. ~5.6K params for 11L. Addresses residual-path interference in quantized models. Very low
Variance-Adaptive Muon (Muon-VS) (arXiv:2601.14603) 0.002-0.005 Variance normalization before NS orthogonalization. Reduces Muon's step-size sensitivity + hyperparameter sensitivity. Zero extra hyperparameters — direct drop-in. Lower val loss than standard Muon on GPT-2/LLaMA. Very low
TEON cross-layer Muon (arXiv:2601.23261) 0.003-0.007 Joint tensor orthogonalization across ALL layers (vs Muon's per-layer NS). Captures inter-layer gradient relationships. Consistent ppl improvement 130M-774M. Targets loss per step — critical for 600s budget. Moderate

Tier 3 — Novel approaches, higher risk


What Doesn't Work (25+ documented failures)

Three failure patterns. (1) Throughput cost exceeds quality gain. In a 600s budget, anything adding >10% step overhead needs >10% per-step improvement to break even. QAT (#236: 115ms vs 67ms baseline), NorMuon (#236: 110ms), and MTP (#212, #236: 86ms) all fail this test. (2) Mechanism redundancy. Stacking two techniques that extract the same signal yields diminishing returns — TTT+XSA underperforms XSA-alone (#290 vs #265), error-guided TTT doesn't improve over uniform TTT (#296), EMA without XSA hurts (#201). (3) Regime incompatibility. Techniques optimized for int6 break under different weight representations — the standard stack (XSA, SmearGate, WD, EMA/SWA, TTT) all fail on ternary (#367), and recurrence amplifies quantization error 900× (#363).


The Current Baseline Stack

The foundation that most competitive submissions share. Worth noting: several top submissions diverge from consensus in specific ways that paid off — #180 used int5 (former official SOTA), #236 used 524K batch instead of 786K, #76 dropped QAT and raised LR, #265 added XSA from a recent paper. The meta is a strong starting point, but the data shows room to improve individual components.

The core five: Integer quantization (int6-all or int5-MLP/int6-attn) + MLP 3x expansion + sliding window eval (stride=64) + zstd-22 compression + precision passthrough for sensitive layers (usually FP16 tied embedding; #236 uses int8 to fund MLP capacity). Near-universal across all competitive submissions, though quant precision varies — #76, #267, and former SOTA #180 use int5-MLP to fund larger BigramHash or extra layers.

Near-consensus optimizer settings: Muon momentum 0.99 (warmup from 0.92 over 1500 steps), halved LRs (matrix=0.02, scalar=0.02, embed=0.03), warmdown 3000 iters, grad clip 0.3. Most top submissions use these. Exceptions: @unixmadtoonslab's #76 (1.1468) uses higher LRs (0.03) and lower momentum (0.97). @saml212's #236 (1.1400) used 524K batch instead of 786K, gaining 0.017 BPB via more gradient updates. However, #375's systematic study on the #315 frontier base found 786K > 524K by 0.004 BPB (3-seed) — at the frontier, total tokens matter more than gradient steps. The optimal batch size is stack-dependent: 524K helps Tier 2-3 stacks; 786K helps XSA+EMA frontier stacks.

Part of the top stack: SmearGate + BigramHash + OrthoInit — used by most top validated entries. Requires OrthoInit to work (per #212's ablation). 11 layers + WD 0.04 + weight averaging (SWA or EMA). The standard-arch frontier (#414, 1.1228) builds on EMA + XSA4 + GPTQ-lite + Tight SWA + VE128 + Partial RoPE + LN Scale + Late QAT. The overall non-TTT frontier is now #609 (1.1154, XSA-all + Full GPTQ + Selective Pruning + Parallel Muon, @saml212).

Common but not universal: QAT with STE (~half), SWA (~17/49 validated), NorMuon (~3/49), FA3 (~13/49).

The Core Five Explained (for newcomers)

1. Int6 Quantization (instead of Int8)

Standard post-training quantization maps each weight to an 8-bit integer (256 levels). Int6 uses only 6 bits (64 levels, range [-32, 31]) with per-row scale factors, then compresses with zstd (level 22) instead of the baseline's zlib-9. Int6 frees ~25% more artifact space than int8, reinvested in a bigger model. Some submissions keep sensitive layers in fp16 (tied embedding) or int8 (embeddings) to limit compounding precision loss.

Origin: @nanlliu introduced int6 mixed precision in #39.

2. MLP 3x Expansion

The baseline uses 2x MLP expansion (hidden dim 1024 for 512-dim model). Top submissions use 3x (1536). Wider MLP = more expressive capacity, funded by int6 artifact savings.

Origin: @jfprincz in #70 (Mar 19 08:57 UTC). @saml212 independently reached the same insight in #61 later that day.

3. Sliding Window Evaluation

Overlapping windows (stride=64, window=2048) give each scored token 1984+ tokens of context vs minimal context with non-overlapping chunks. Purely eval-time. Worth 0.034 BPB per @samacqua's ablation in #77.

Origin: @mattqlf in #50. Stride debate: stride=256 gives marginally better BPB at 4x less eval time (#114). Doc isolation hurts at stride=64 — use flat-stream eval (#199).

4. FP16 Tied Embedding

The tied embedding matrix (input + output) is uniquely sensitive to quantization — errors compound in both directions. Keeping it in fp16 (~1MB) is the single highest-value precision decision.

Origin: @chonchiog in #42.

5. Zstd-22 Compression

Zstandard at level 22 squeezes int6 data significantly tighter than zlib-9 — enough to fit ~1-2M more parameters. Compression happens once after training; decompression is fast. Free lunch.


The Path Down: What Separates Each Tier

The competition spans a 0.11 BPB range from baseline (1.2244) to the best record-eligible pending (1.1154, #609 — XSA-all + Full GPTQ + Selective Pruning). Two-track frontier: non-TTT (#609 at 1.1154) and legal TTT (#606 at 1.1162). Pre-eval TTT entries are excluded from the record track (15+ closed in Mar 24 enforcement sweep). What separates each tier isn't just techniques — it's a fundamentally different approach.

Tier 1: Tweaking the Baseline (1.20–1.22 BPB)

Submissions in this range make one or two changes to the baseline: a longer sequence length, a learning rate sweep, a warmdown adjustment. The approach is "how do I improve this model?" — treating the baseline as mostly correct and looking for low-hanging fruit.

This works for the first 0.02 BPB, but hits a wall fast. The constraint isn't hyperparameters — it's the artifact budget. At int8+zlib, you can't fit enough model capacity to go further. Many submissions in this range are also on non-standard hardware (RTX 4090, Apple Silicon, 1xH100), which limits training tokens and disqualifies from the record track.

What to do if you're here: Adopt the core five (int6, MLP 3x, sliding window, FP16 embed, zstd-22) as a package. Each technique is well-documented in the deep dives below. Together they're worth ~0.05-0.07 BPB — the single biggest jump available.

Tier 2: Stacking Known Techniques (1.15–1.18 BPB)

These submissions adopted the core five and are assembling additional techniques: SmearGate, BigramHash, SWA, QAT, NorMuon. The approach is "what techniques exist and how do I combine them?" — surveying PRs, identifying high-impact components, and building a combined recipe.

This is effective: the leap from 1.22 to 1.16 is largely a stacking exercise. But submissions in this range often stop at "I added all the techniques" without investigating interactions. Common patterns: using SmearGate without OrthoInit (which hurts — per #212's ablation), running QAT from the start (which hurts — late QAT at 70-85% is better), or using SWA without sufficient weight decay (SWA shows no effect below WD=0.04).

What to do if you're here: Run ablations. Remove one technique at a time and measure the delta. You'll often find that one "improvement" is actually hurting because of interaction effects. Check your hyperparameters against the consensus (LR=0.02, momentum=0.99, warmdown=3000) but also against divergent successes like #76 (LR=0.03, momentum=0.97). Multi-seed validation (3 seeds) is essential — single-seed scores can be off by 0.002+ BPB.

Tier 3: Understanding Interactions (~1.120–1.15 BPB)

These submissions adopted the full technique stack and understood why each technique works. @jfprincz (#198 at 1.1326) is the canonical example: 11 layers + SmearGate + BigramHash + OrthoInit + WD 0.04 + SWA + FA3 assembled into a coherent system where each piece reinforces the others — WD makes weights compressible AND quantization-friendly, SmearGate+OrthoInit inject bigram context the small model can't learn from attention alone, and SWA smooths the weight landscape during warmdown.

The approach is "how do these techniques interact, and what's the optimal system?" Key markers of Tier 3 thinking:

What to do if you're here: Solidify your baseline with multi-seed validation. The primary path to Tier 4 is adopting XSA + Full GPTQ + EMA. Two proven routes: (a) #609's frontier — XSA-all + Full GPTQ + Selective Pruning + Parallel Muon + LeakyReLU² (1.1154, best non-TTT — ⚠️ calibration question pending); (b) #606's TTT path — Int5 GPTQ + Soft-Round QAT + legal cosine TTT (1.1162, best legal TTT). #631 is attempting to combine both. All share XSA + EMA as infrastructure.

Tier 4: Architecture Frontier (<~1.120 BPB)

Two non-TTT paths — standard arch leads! (1) Standard architecture now at #609 (1.1154, @saml212): XSA-all + Full GPTQ + Selective Pruning + Parallel Muon — beats GEPA (#505, 1.1181) by 0.0027. (2) GEPA architecture (#505 at 1.1181, @JoeProAI): Star-ReLU + U-Net Skip Gates + seq2048.

The key insight at Tier 4: EMA (0.997) outperforms standard SWA by 0.003 BPB (#375, 3-seed verified). @unnir's Tight SWA (scale<0.2, #374: 1.1246, 1 seed) may be an exception — "Tight" SWA restricts averaging to the last ~600 steps, which is functionally closer to EMA than standard SWA. Until #374 gets multi-seed validation, EMA is the safer default. #315 demonstrates that the XSA+EMA base still had headroom via careful regularization — Partial RoPE, LN Scale, and Late QAT each target a specific weakness.

What to do if you're here: Two tracks. (a) Non-TTT: #609 is the new baseline — XSA-all + Full GPTQ + Selective Pruning + Parallel Muon = 1.1154 (3-seed, @saml212). Remaining untried: Mousse optimizer, OptRot, systems opts (Liger-Kernel, 2:4 sparsity). Entropy coding dead at frontier (lzma at 99.7% Shannon limit per #609). (b) Legal TTT: Best legal: #606 (1.1162), #615 (1.1169). GEPA + legal TTT at 1.0983 on 4xA100 (#628, 20k steps) — 8xH100 record-eligible version untried (~1.116-1.120 projected at 7k steps). Pre-eval TTT variants closed in Mar 24 sweep.

Technique Interactions Matter More Than Technique Count

A recurring pattern: techniques that work independently can fail in combination. TTT+XSA actively hurts (#303: +0.016 worse), EMA fails without XSA (#201) but succeeds with it (#287), and 12L fails at seq2048 but works at seq1024 (#219 vs #76). #474 confirms this extends to newer techniques: VRL + Gated Attention + Catalytic Residuals stacked on a 12L SWA base (no XSA, no EMA) yielded 1.1690 — worse than the same base without them (1.1466). Frontier techniques are optimized for the frontier base; applying them to weaker bases produces negative or null returns.

The untried combinations above should be evaluated against your specific model's weaknesses, not applied blindly. XSA + EMA appears to be a prerequisite for most newer techniques (VRL, GA, legal TTT). The strongest remaining candidates are systems optimizations (fused kernels, 2:4 sparsity — throughput gains with significance waived) and compression innovations (OptRot, entropy-coded weights — freeing artifact space for capacity).

Val-Data & TTT Rulings (Mar 20-24)

Val data ruled out (Mar 20, @0hq): Val tokens cannot be in the artifact. Paid prefix (#168), error correction (#108), val-only training all banned for record track. Now in README FAQ.

TTT ruling (Mar 20, @0hq on #152): Only backward-looking TTT allowed — adapt on tokens already graded, not future tokens. Pre-eval adaptation invalid. Causal TTT (#267-style) remains allowed. In README FAQ.

Mar 22, @cocohearts on #317: TTT is "not in the spirit of the challenge." Broader organizer signal — even backward-looking TTT may face scrutiny.

Mar 23, @0hq on #402: Explicit TTT clarification — token-stream model is correct. You may use any preceding eval tokens already graded. You may NOT re-order the evaluation set. Invalid TTT PRs (train-on-val-then-measure) will be closed. Auto-review process being built.

Mar 23, @cocohearts: #374 rejected for insufficient statistical significance vs new SOTA. #505 needs packaging fixes.

Mar 24, @valerio-oai — enforcement sweep (15+ PRs closed). Two categories: (1) TTT information leakage: multi-epoch TTT with min-NLL selection, and adapting-then-scoring same tokens, both ruled equivalent to "training on the val set." #593, #576, #573, #568, #596, #605, #614, #620, #518, #548 closed. (2) Training data at eval time: GPTQ calibration using training data during eval budget disallowed. #593, #576, #569 closed for this. Calibration must count within training 600s. #589 ruled valid but closed — fails 0.005-nat threshold vs #549 SOTA. valerio-oai confirmed: "TTT is a valid approach in theory" but "very easy to unintentionally leak val data into."


Technique Deep Dives

The Muon Optimizer Family

Muon (MomentUm Orthogonalized by Newton-Schulz) is the optimizer at the heart of this competition's baseline, created by Keller Jordan for the NanoGPT speedrun. It runs standard SGD with Nesterov momentum, then post-processes each 2D parameter's gradient update by replacing it with the nearest orthogonal matrix via Newton-Schulz iteration. Intuitively: compute the gradient direction, then "clean it up" so the update is maximally informative without redundant directions. It's equivalent to steepest descent under the spectral norm, which improves the conditioning of the optimization landscape. ~35% faster training than AdamW on language models.

NorMuon extends Muon by adding per-neuron adaptive learning rates from accumulated second-order statistics. Vanilla Muon can produce updates with highly non-uniform norms across neurons, causing some neurons to dominate training. NorMuon normalizes row-wise after orthogonalization, combining Muon's conditioning benefits with Adam-style balanced per-neuron learning. It also improves distributed scaling by avoiding full momentum gathering across GPUs. Used by @mtybadger (#122), @vmfunc (#89), @abhishekgahlot2 (#137), and others.

Muon Weight Decay — The competition baseline's Muon optimizer has no weight decay. Decoupled weight decay for Muon (p.mul_(1 - wd * lr)) existed in modded-nanogpt since Nov 2025, but wasn't in the baseline. @notapplica was the first to bring it into this competition in #60, improving BPB from 1.2160 to 1.2094. Weights stay smaller and better-distributed, improving both generalization and compressibility.

Quantization-Aware Training (QAT) with STE

Instead of training in full precision and quantizing afterward, QAT simulates quantization during training. In the forward pass, weights are rounded to their quantized values. The problem: rounding is non-differentiable, so gradients can't flow through it.

The Straight-Through Estimator (STE) solves this by pretending the rounding operation is the identity function during the backward pass. It's mathematically "wrong" but works remarkably well — the model learns weight configurations that are robust to precision loss because it's been "seeing" quantized weights throughout training.

Late QAT outperforms full-training QAT: The later, the better. @trovatochris (#117) activates at 70%, @mohosy (#130) at 75%, @unixmadtoonslab (#76) at 85%. #76 even dropped QAT entirely at 12L (1.1468), finding WD=0.04 alone sufficient. @jfprincz's #315 pushes this to the extreme: STE activates only in the final 4% of training (lr_scale < 0.1, during low-LR warmdown). This cuts the int6 roundtrip gap to ~0.007 BPB while preserving full-precision convergence. The lesson: QAT activation is a spectrum — later = cleaner convergence, better int6 gap.

Int8 vs int6 QAT tradeoff: @mrdavtan's ablation in #145 shows that int8 QAT is not worth it under the 10-min wallclock cap. The torch.quantile call for exact percentile matching adds ~20% per-step overhead (64ms → 77ms), costing ~2,000 training steps. Result: 1.2052 BPB with QAT vs 1.1925 without — the lost training tokens hurt more than closing the ~0.007 int8 quantization gap. Int6 QAT, however, likely pays off because its larger ~0.01+ BPB gap justifies the overhead — confirmed by #128 and #137.

SmearGate & Bigram Hash Embedding

@unnir introduced SmearGate in #102 and refined it in #135. This appears to be a novel technique for this competition — no published papers found.

SmearGate: A tiny learned gate (~512 params) that blends each token's embedding with the previous token's. This injects bigram (two-token) context directly into the embedding layer before the transformer starts processing. Normally a transformer must discover token pair relationships through self-attention; SmearGate provides this signal for free.

Bigram Hash: A hash table (commonly 2048-10240 buckets, dim=128, projected to 512) that maps token pairs to learned embeddings. Together with SmearGate, this gives the model token-pair awareness at nearly zero parameter cost.

@unnir's original combination with orthogonal initialization achieved 1.1539 BPB in #135. @jfprincz's #198 (1.1326) extended this with 11L + SWA + FA3 + WD 0.04, and #287 (1.1280) extended further with XSA + EMA.

OrthoInit appears critical for SmearGate. @mrdavtan's ablation in #212 found that adding SmearGate + BigramHash without OrthoInit hurt BPB (1.1739 vs 1.1708 without). Every successful SmearGate submission uses OrthoInit — the two techniques may be co-dependent.

Exclusive Self-Attention (XSA)

XSA (arXiv:2603.09078, Shuangfei Zhai, 2026) removes self-value bias from attention output via orthogonal projection. In standard attention, each token's value vector contributes to its own output — XSA subtracts this self-component, forcing the model to rely on information from other tokens. Applied to the last 3-4 layers only ("Partial XSA"), where self-attention bias is highest.

Zero parameters, minimal overhead. @unnir's #265 GQA-aware implementation reduces XSA overhead from ~7ms/step to ~2ms/step. Near-universal among frontier submissions. Best non-TTT (#609, 1.1154) uses XSA on all 11 layers; official SOTA (#549, 1.1194) uses XSA4.

XSA coverage depth: 4 layers appears near-optimal. @gowtham0992's #478 tested XSA on ALL 11 layers: 1.1268 (3-seed) vs XSA-4 at 1.1327 on the same base (−0.006 from XSA-all). But #414 (XSA-4 + VE128 + Partial RoPE + LN Scale) reaches 1.1228 — better than #478's XSA-all(11) at 1.1268. XSA-all adds ~3ms/step overhead (−230 steps), and removing self-value from ALL layers may degrade the model's own-representation capacity. The progression: 3 layers (#265: 1.1307) → 4 layers (#414: 1.1228) → 11 layers (#478: 1.1268) suggests 4-6 layers is the sweet spot for non-TTT. However, #609 (1.1154, best non-TTT) uses XSA-all(11) and #606 (1.1162, best legal TTT) also uses XSA-all — at the current frontier, XSA-all with Full GPTQ overcomes the overhead penalty.

Test-Time Training (TTT)

@samacqua introduced a creative approach in #77: adapting the model during evaluation.

For each validation document, rank-8 LoRA (Low-Rank Adaptation) adapters are trained on the document's own text using only backward-looking context (no data leakage). The model essentially "studies" each document briefly before being scored on it. LoRA makes this practical by only training tiny low-rank matrices (~1.5% of params) rather than the full model, enabling batched per-document adaptation within the eval time budget.

Original #77 ablation showed TTT itself adds ~0.003 BPB on early baselines (most gain came from doc isolation + sliding window). Full-model SGD TTT (#152) was ruled invalid by @0hq — only backward-looking (score-first) TTT is legal. Modern legal TTT gains are much larger: #606 reaches 1.1162, #615 reaches 1.1169.

TTT on XSA+EMA is a spectrum, not a binary. On SmearGate bases: #254 shows 0.014 BPB gain. Three XSA+EMA data points, sorted by base strength: (1) #317 (weak base, pre-quant 1.1581, no FA3): TTT gains 0.024 BPB. (2) #338 (@alertcat, #315 base — frontier at 1.1250, Partial RoPE + LN Scale + Late QAT): TTT neutral ±0.001 (3 seeds). (3) #303 (@sseanliu, #287 base — 1.1280, without #315's additional regularization): TTT +0.016 BPB worse. The pattern suggests TTT interacts with how tightly converged the base model is: under-trained bases benefit from local adaptation; over-regularized frontier bases are disrupted; the current frontier (#315) sits in a neutral zone. #338's neutral result is informative — it means TTT is not a meaningful lever at the frontier.

Reptile meta-TTT: gains on SmearGate, fails at frontier. @sseanliu's #296 shows 0.011 BPB on SmearGate models vs 0.001 naive. But #375 tested Reptile on #315's XSA+EMA base: +0.0076 worse, consuming 20% of training budget. The SmearGate gain does not transfer to the frontier. All three TTT variants (naive, MLP-only, Reptile) are now confirmed dead ends at ~1.125. Error-guided TTT is also negative — hardest tokens are genuinely unpredictable.

TTT optimizer recipe matters. @Christopher-Lee-McClendon's #461 (non-record, 4xA100) found that SGD+momentum(0.9), 3 epochs per 32K chunk, freezing first 2 blocks gets −0.0165 BPB TTT gain — 2.4× better than AdamW 1-epoch over all params (−0.0068 in their prior #456). Pre-TTT baselines nearly identical, so the entire improvement comes from the TTT recipe. This partially contradicts the #442 narrative (AdamW >> SGD) — the comparison is more nuanced: selective freezing + multi-epoch SGD with momentum can outperform single-epoch full-network AdamW.

Legal TTT works — two validated survivors after enforcement sweep. #606 (1.1162) and #615 (1.1169) remain open. #576 (1.1164) closed in Mar 24 sweep (GPTQ calibration at eval time). #573 (Multi-Pass min(NLL)) ruled invalid. Only single-pass score-first TTT is legal. TTT optimizer matters for GPTQ: SGD TTT hurts Full GPTQ models (+0.030, #601), but AdamW with cosine LR works (#606, #615). The adaptive LR + Soft-Round QAT make weights robust to TTT gradients.

Cosine TTT scheduling is a 3× multiplier. @mrdavtan's #481 (3-seed, 1.0970) introduced two TTT innovations on top of AdamW TTT: (1) cosine LR decay over 30 epochs — high LR early to repair quant damage, low LR late to refine; (2) per-layer LR groups based on measured quantization error — 3× base LR for MLP output projections (3.4× higher quant error), 0.5× for input projections. Result: TTT gain of −0.061 BPB vs #442's −0.019 with flat LR — a 3× improvement from scheduling alone. Pre-TTT ~1.158 (weaker base, FA2 not FA3). Also tested: focal loss and KL-divergence from pre-quant model — both failed to improve over CE. ⚠️ Pre-eval TTT.

#315's Techniques: Partial RoPE, LN Scale (Late QAT was inactive)

@jfprincz's #315 (1.1250) adds two effective zero-parameter techniques on top of #287's XSA+EMA base, gaining 0.0023 BPB. Note: Late QAT was also included in the code, but torch.compile constant-folded the _qat_enabled flag, making the STE branch dead code — Late QAT never activated (discovered by @152334H, confirmed in #453). The 0.0023 gain comes entirely from Partial RoPE + LN Scale.

Partial RoPE (16 of 64 head dimensions). Rotary Position Embedding (RoPE) injects position information by rotating query/key vectors. Standard RoPE applies to all head dimensions. Partial RoPE applies to only 25% (16 of 64 dims) — the remaining 48 dims attend without position encoding. Why this helps: the position-free dims learn semantic similarity independent of token distance, improving generalization across different position ranges. The model can learn both "what things are" (position-free) and "where things are" (position-encoded) using different parts of the same head. Zero new parameters.

LN Scale (output scaled by 1/√(layer_idx+1)). After each RMSNorm, the output is multiplied by a layer-dependent scale factor that shrinks with depth. Layer 0: ×1.0; Layer 5: ×0.408; Layer 10: ×0.302. This damps the contribution of deeper layers to the residual stream, preventing later layers from "overwriting" early representations. Training is more stable — the model can use depth incrementally rather than being forced to route everything through deep layers. The 1/√(layer+1) schedule is related to the "depth scaling" used in some architecture papers. Zero new parameters.

Late QAT (STE enabled only when lr_scale < 0.1) — ⚠️ was dead code in #315. torch.compile constant-folded the _qat_enabled class attribute, so the STE branch never activated (discovered by @152334H, confirmed in #453). The concept is sound — late activation avoids corrupting Muon's momentum — but #315's actual gains came from Partial RoPE + LN Scale alone. Working Late QAT: @unnir (#374, scale<0.1), @signalrush (#414, threshold 0.15), @fbedev (#417). Downstream submissions copying #315's code may also have inactive Late QAT.

The two active techniques (Partial RoPE + LN Scale) gain 0.0023 BPB vs #287 — statistically clear (3-seed variance 0.0005 BPB, t-stat -101.9 vs SOTA, p << 0.01).


Notable Non-Record Submissions
Author PR Highlight
@mohosy #130 7 toggleable improvements; QAT + Muon momentum analysis
@MatoTeziTanka #95 PROTEUS EMA — reduces int8 quant loss 0.0072→0.0048
@nglain #141 33-experiment sweep; found int6 STE + Muon conflict (+0.007)
@kellyvv #108/#232 Error Correction Table — stores model's worst predictions, ~1.05 est. on 8xH100
@mrdavtan #145 Int8 QAT ablation — overhead exceeds recovery
@timothywangdev #220 [WIP] First SSM (Linear Recurrent Unit) — non-transformer architecture
@mkenney2 #599 Hymba: Hybrid Attention + Mamba SSM (first competitive non-transformer). 7L parallel attn+SSM branches with learned mixing. 1.1828 BPB, 3 seeds, 8xH100. Key: shallow models win (SSM makes each layer more powerful → 7L beats deeper pure transformers at same step budget).
@alons23 #216 Ternary Universal Transformer — 68M params, 4×6 depth recurrence
@Cwarren15-A #283 PPM-C context mixer — classical compression blended with neural (0.015 BPB on baseline)
@sseanliu #296 Reptile meta-TTT — 0.011 BPB gain on SmearGate models (10x naive TTT). Error-guided TTT negative.
@integrate-your-mind #289 11L seq1024 + U-Net skips (1.1518). TTT LoRA worse than sliding window alone on this base.
@gowtham0992 #295 Backout (learned residual subtraction) + mixed int5/int6 QAT + U-Net skips (1.1477, 1 seed)
@JackYoung27 #302 Online causal TTT + decay prior (p += λ(p₀-p)) + Reptile (last 10%) + XSA3 + Pre-Q/K RMSNorm. TTT gain: -0.014 BPB (1.1660→1.1520). Adapts MLP only in last 3 blocks. Int5-MLP/int6-attn + BigramHash(10240). 1 seed.
@xuafeng #306 QAT Int5/Int6 on #180 base: post-training quant outperforms QAT by ~0.002 BPB — quant noise acts as beneficial regularization that QAT removes (1.14476, 1 seed)
@NewyorkDev #309 CLASE-Quant adaptive per-layer quantization: int8 for boundary layers, int6 for middle — saves ~15% vs uniform int8 (1.1914, 3 seeds)
@chanwoo-park-official #312 Canon ACD layers (Allen-Zhu 2025) on 9L stack — learnable 1D conv (k=3) placed before attention, before MLP, and in MLP hidden stream (avoids QKV=B for cost). 1.1668, 1 seed. Novel architecture technique; interesting if it scales to 11L.
@SkywardSyntax #316 12L Low-Rank Q (r=128) + QAT int7 on 1xH100 (pre-quant 1.2035, awaiting 8xH100). Key negative result: FTLE per-row precision is a dead end — uniform int-N beats mixed-row at every bit width due to higher entropy defeating zstd. Layer sharing also abandoned at 512d (costs 0.09 BPB, no space benefit).
@aravhawk #314 11L Int4 MLP QAT on #180 base — int4 MLP saves ~2MB to fund 11th layer vs #180's 10L int5. Awaiting 8xH100 results. Record track aspirant.
@Rhodrium #331 10L MLP3x + BigramHash(2048) + SmearGate + OrthoInit + mixed int5/int6 + SWA + stride=32 eval. 1.1487 BPB, 3 seeds. Solid consensus stack; above SOTA but clean stride-32 reference on H100s (94/91ms/step).
@sheeki03 #339 Backout ablation: -0.0071 BPB on #198 base (1.1435→1.1364). First clean measurement. ⚠️ artifact 16.17MB (over limit), 1 seed. Plans int5-MLP fix + XSA/EMA combo.
@Ananddna #327 TrigramHash (8192 buckets) + Partial RoPE (50%) + per-head temperature scaling + stride=32 eval. 1.1450, 2 seeds. Three novel techniques on 10L int5 base.
@mahsumaktas #333 23-run systematic exploration (1.1565, 3 seeds). Key findings: seq curriculum fails (SWA incompatible across seq lengths), EMA causes 0.14 BPB quant gap on SWA-stack, MLP 2.75x sweet spot at 11L+SmearGate, Late QAT 75% cuts quant gap 0.023→0.006.
@sseanliu #318 Neural Cache research proposal — maintain per-layer KV cache across sliding windows, extending effective context from 2K to 50K+. Zero artifact cost, backward-looking compliant. Untested (torch.compile state bug). Proposed on #287 base (1.1284).
@fbedev #348 QAT + BigramHash(12288) + stride=32 on #180 base. 1.1444, 1 seed. Barely above SOTA — diminishing returns from BigramHash >10240.
@sp00mm #352 Memory Tokens: 64 learnable embeddings as global context scratchpad. A/B: -0.014 BPB. Uses #315 stack + MTP aux heads. 1.1659, 1 seed.
@jackopenn #336 Hypernetwork prototype — shared-trunk MLP generates full GPT weights from compact conditioning vectors (9.34x compression, 26.5M target params from 2.8M hypernet params, 2.09MB artifact). No BPB result yet. Highest compression-ratio weight-generation approach seen.
@mkenney2 #362 11L SmearGate+BigramHash(4096)+EMA+OrthoInit, WD=0.02, stride=256. 1.1497 (3-seed). Key negatives: AttnRes -54% throughput, seq curriculum compile overhead, depth recurrence, 13L+TTT compression.
@shikhar1729 #364 524K batch on #180 base — 1.1497 (3-seed). Validates 524K batch benefit: more optimizer steps per wall-clock minute.
@charmquark1984 #375 $500 systematic frontier study. 13 techniques on #315 base, all failed. Reptile +0.008 worse. EMA>SWA +0.003. 786K>524K +0.004. See What Doesn't Work.
@anthony-maio #376 9L + full stack + custom Triton/CUDA kernels (fused RMSNorm+QKV 1.47×, fused ReLU² MLP 1.26×). 1.1401, 1 seed. 125ms/step (4,782 steps). Kernel pipeline in dev for next submission.
@abaybektursun #399 First Muon systems optimization. Parameter Banking + Polar Express + Parallel Muon = 82.14ms/step (−3.1% vs #315's 84.76ms, +227 steps). Lossless — identical pre-quant 1.1421. ⚠️ Artifact 20.4MB (packaging issue). Significance waived for systems-only.
@anantdgoel #384 3 research directions: MAML Meta-TTT = +0.085 worse (5th dead TTT variant). Eval stacking (cache + OGD on vocab bias): −0.003 additive, zero artifact cost. Tokenizer v8192: null result — longer tokens harder to predict, offsetting compression. 1xA40, 1.2882.
@anantdgoel #413 Value Residual: −0.015 BPB (dev). Gated Attention: −0.003. Stack additively (−0.017). PPM-C: +0.002 (negative). 9L dev-scale, 1xRTX3090.
@anantdgoel #487 VRL+GA on 11L production stack (1xA6000, 14.5hr). 1.1720 BPB, 19.4MB (over limit). Confirms dev ablation (−0.017 additive). Not 8xH100 — VRL on 8xH100 frontier still untested by originator (#486 by @ndokutovich tested VRL+Cosine TTT at 1.0887).
@zachgoldfine44 #450 12L + Catalytic Residuals (novel: x + c*f(x), learned per-dim vector c). −0.024 BPB at zero overhead. 3-seed mean 1.1466. Built on #180.
@Christopher-Lee-McClendon #461 High-yield legal TTT: SGD+momentum(0.9), 3 epochs per 32K chunk, freeze first 2 blocks. TTT gain: −0.0165 (2.4× better than AdamW 1-epoch). Depth recurrence (11L from 10 cores). 1.14458, 4xA100.
@joshuaswarren #474 First VRL+GA+Catalytic Residuals stack on 12L + BigramHash(10240) + SWA + Late QAT. 1.1690 — disappointing vs #450's 1.1466 (same base without VRL/GA). Techniques don't stack additively here: no XSA, no EMA → weak base dilutes gains.
@leofeasby #470 Shared-weight transformer (single block × 9 passes) + U-Net skips + extended warmdown. 1.1454, 2.3hrs 8xH100. Key finding: improvement continues steadily throughout low-LR warmdown — no plateau observed.
@LoquiAuris #465 Int6 embedding quantization: +0.0005 BPB penalty — essentially free. Systematic tokenizer study: sp8192 d=512 8L (1.1794) vs sp1024 d=512 10L (1.1508) — more layers > tokenizer efficiency. 3-seed std=0.00012.
@carlesonielfa #457 11L + XSA + VRL (Value Residual Learning) + SWA + seq4096 + cross-doc TTT. 1.1839 (int8+zlib). Another VRL adopter.
@AnirudhRahul #511 Delayed PPM eval-time bank on #180 base. Classical n-gram backoff (C trie) with 2048-token delay — only sees tokens outside transformer's window. −0.00126 BPB (p=0.000041, 3-seed) — real but below 0.005-nat record bar. Zero artifact cost, composable with any model. First positive classical compression result at frontier.
@Robby955 #484 TTT Memorization Analysis (updated from EBLS). Diagnostic: 3-epoch TTT adapted weights score 1.0476 via sliding window (genuine adaptation). At 10 epochs: 0.8566 TTT-loop / 0.9229 sliding — both below ~0.95 theoretical floor = memorization. Implication: #512's 0.95 seeds are likely memorization artifacts, not real gains. Also: MLP weights are layer-invariant (EBLS gammas → 0).
@Christopher-Lee-McClendon #598 7000-step GEPA (4xA100). Extended warmdown + mixed int6/int8 + legal TTT. 1.1334 BPB.
@Christopher-Lee-McClendon #628 Sub-1.10 GEPA (4xA100, 20k steps). 8k warmdown + int6 GPTQ-lite + legal TTT. 1.0983 BPB. Scaling law: warmdown is dominant lever.
@SPThole #623 First AWQ in competition — activation-aware weight scaling (α=0.5) before quant. Closed 63% of quant gap (0.027→0.010). Cyclic Muon Momentum (triangle wave 0.85-0.95). 21+ experiments. 1.1507, 3-seed.
@CiprianFlorin-Ifrim #641/#640 Binary/Ternary U-Net — radical compression frontier. Binary (1-bit): 106.2M params in 15.67MB via bit-packing, 15L 768d, 1.1239 BPB (non-record, 50k steps). Ternary (1.58-bit): 73.7M params, 10L 768d, 1.1570 BPB (3-seed, 599s). NeoMuon optimizer, 8192 BPE tokenizer, FP8 QAT, YaRN 2048. 250+ experiments. "Train larger, quantize harder" taken to extreme.

Idea Lineage & Diffusion (45 techniques tracked)
Technique First Appeared Originator Adoption
Sliding Window Eval #50 @mattqlf Near-universal (20+)
FP16 Tied Embedding #42 @chonchiog ~10+
Int6 Quantization #39 @nanlliu ~15+
MLP 3x Expansion #70 @jfprincz ~12+
Muon Weight Decay #60 @notapplica (from modded-nanogpt) Several
Overtone Spectral Init #60 @notapplica @peytontolbert (#155), @TevBenji (#69)
SmearGate / BigramHash #102 @unnir Near-universal (25+). All competitive submissions use SmearGate+BigramHash+OrthoInit.
OrthoInit #135 @unnir (combined with SmearGate) Near-universal among top SmearGate submissions. Critical co-dependency: SmearGate hurts without OrthoInit (#212 ablation).
Test-Time Training #77 @samacqua (LoRA TTT) @timowhite88 (#152 SGD, #254 first TTT+SmearGate+11L), @polarizedfortnite-cpu (#81, first TTT+int6), @andrewgcodes (#267 Causal TTT), @charmquark1984 (#281), @ibarrajo (#290, TTT+XSA), @mohosy (#291, pending), @sseanliu (#296, Reptile meta-TTT), @davidpuertolas (#297), @alertcat (#338, TTT on #315 frontier base — neutral), @felipe-parodi (#398, 20-epoch aggressive TTT, 1.1221), @kasimte (#455, SGD TTT on #374 base), @Christopher-Lee-McClendon (#461, high-yield SGD+momentum TTT), @abaybektursun (#473, legal TTT — 1.1214), @LoquiAuris (#548, batched LoRA TTT — 1.0865), @Sarimsaljook (#573, Multi-Pass TTT — 1.0523 ❌ ruled invalid)
NorMuon Multiple PRs Convergent @mtybadger, @vmfunc, @dexhunter, others
QAT with STE Multiple PRs Convergent @rsavitt, @yahya010, @trovatochris, others
SWA #89 @vmfunc @mtybadger (#122), @dexhunter (#156), @anthony-maio (#376), others
Depth Recurrence Multiple PRs Independent @MatthewHRockwell, @koushikkethamakka, @iverbovoy (#148), others
Int5 MLP Quantization #76 @unixmadtoonslab @thwu1 (#180, former SOTA), @alertcat (#219, mixed int5/int6), @Mapika (#349), @Skrisps26 (#354), @signalrush (#369)
BigramHash Scaling (4096–16384) #180 @thwu1 (10240) @andrewgcodes (#267, 16384), @simonbissonnette (#466, 12288), @JoeProAI (#462, 8192). Diminishing returns >10240 (#348).
Low-Rank Q Factorization #215 @JayCheng113 Novel — no adopters yet
Partial XSA (Exclusive Self-Attention) #265 @unnir Near-universal at frontier (15+): @jfprincz (#287, #315), @signalrush (#369, #414), @saml212 (#332), @chanwoo-park-official (#400), @fbedev (#417), @sjp611 (#442), @JoeProAI (#462), @kasimte (#455), @ofirkris (#458), @Christopher-Lee-McClendon (#461), others
EMA Weight Averaging #95 @MatoTeziTanka (PROTEUS EMA) Near-universal at frontier (12+): @jfprincz (#287, #315), @signalrush (#369, #414), @sjp611 (#442), @JoeProAI (#462, 0.9985), @ofirkris (#458), @simonbissonnette (#466), @felipe-parodi (#398), @parinzee (#493), others. EMA fails without XSA (#201).
Reptile Meta-TTT #296 @sseanliu @JackYoung27 (#302, +causal TTT + decay prior). #375: failed on #315 base (+0.0076 worse).
BitNet b1.58 #126, #139#367 @Athenox14, @ksang123 Two independent. #367: standard stack breaks on ternary.
Partial RoPE #315 @jfprincz (25% dims) @saml212 (#332), @unnir (#374), @felipe-parodi (#398), @signalrush (#414), @fbedev (#417), @kasimte (#455), @ofirkris (#458), @Christopher-Lee-McClendon (#461), @JoeProAI (#462)
LN Scale (1/√layer) #315 @jfprincz Near-universal at frontier (10+): @signalrush (#414), @fbedev (#417), @JoeProAI (#462), @sofiabod (#489, calls it "depth damping"), others. Variant: @eb1386 (#449, cosine)
Late QAT (last 4% only) #315 @jfprincz (⚠️ dead code in #315 — torch.compile bug) Working: @unnir (#374, scale<0.1), @signalrush (#414, threshold 0.15), @fbedev (#417), @JoeProAI (#462). Dropped at 12L (#332).
Gradient-Guided Quant #332 @saml212 @ndokutovich (#486, sensitivity-ranked int7/6/5 — top 10%/70%/20%)
TrigramHash #327 @Ananddna @ndokutovich (#486, 4096 buckets + VRL + GradQuant + Cosine TTT, 1.0887)
Per-Head Temperature #327 @Ananddna Novel — each head learns its own temperature scalar
Tight SWA (scale<0.2) #374 @unnir @dannywillowliu-uchi (#379, +GPTQ-lite), @kasimte (#455, +TTT)
Shared Value Embedding #374 @unnir @dannywillowliu-uchi (#379, +GPTQ-lite), @kasimte (#455, +TTT), @Christopher-Lee-McClendon (#461, layers 9-10), @JoeProAI (#505, GEPA arch, 1.1181)
AdamW TTT #442 @sjp611 (3-line diff from #398: SGD→AdamW) @JoeProAI (#462), @mrdavtan (#481, cosine), @ndokutovich (#486), @sofiabod (#489, 7L), @amaljithkuttamath (#490, +VRL+GA), @ahmettrkck (#491, +DWA), @EthanYangTW (#503, legal AdamW TTT), @ymrohit (#555, closed)
GPTQ-lite → Full GPTQ #379 @dannywillowliu-uchi (per-layer clip percentile search) @signalrush (#414), @fbedev (#417), @gowtham0992 (#478), @EthanYangTW (#503, #606 int5 GPTQ), @raahilshah (#535), @gowtham0992 (#569), @cmcdnd (#576), @newjordan (#587), @saml212 (#609), @danialht (#615). Now standard at frontier.
Value Residual Learning #413 @anantdgoel (arXiv:2410.17897, −0.015 dev) @ndokutovich (#486, 1.0887+Cosine TTT), @amaljithkuttamath (#490, VRL+GA+TTT, 1.0891 1-seed!), @gowtham0992 (#569, VRL no-TTT → 1.1175, best non-TTT at time), @joshuaswarren (#474, failed on weak base), @carlesonielfa (#457), @yuvrajyadav17 (#471, pending), @ahmettrkck (#491, VRL+DWA+TTT)
Catalytic Residuals #450 @zachgoldfine44 (x + c*f(x), −0.024 BPB) @joshuaswarren (#474, +VRL+GA, 12L — 1.1690, techniques don't stack on weak base)
Two-Phase TTT #417 @fbedev (50ep norm-only + 10ep last-3-blocks) Novel — no adopters yet
Gated Attention #413 @anantdgoel (arXiv:2505.06708, −0.003 dev) @amaljithkuttamath (#490, +VRL+TTT, 1.0891), @joshuaswarren (#474, failed on weak base), @yuvrajyadav17 (#471, pending)
Cosine TTT + Per-Layer LR #481 @mrdavtan (cosine LR decay + 3× MLP output proj LR) @sofiabod (#518, cosine+per-layer → 1.0814), @ndokutovich (#486, cosine → 1.0887), @Christopher-Lee-McClendon (#537, per-layer LR on legal TTT). ⚠️ Pre-eval TTT (except #537)
XSA-All (11 layers) #478 @gowtham0992 (first to test XSA on all layers) @EthanYangTW (#503, #606), @cmcdnd (#576), @newjordan (#587), @saml212 (#609, best non-TTT), @danialht (#615). Now standard at frontier.
LeakyReLU(0.5)² #434 (closed) → #493 @parinzee (squared leaky ReLU, 0.5 neg slope) @sofiabod (#518), @raahilshah (#535), @Christopher-Lee-McClendon (#537), @abaybektursun (#549), @gowtham0992 (#569), @cmcdnd (#576), @RoyiRa (#589), @saml212 (#609), @robinojw (#620). 10+ adopters — fastest-spreading technique.
Delayed PPM Eval Bank #511 @AnirudhRahul (classical n-gram backoff with 2048-token delay, on @thwu1's #180 base) Novel — −0.00126 BPB at p=0.000041. Zero artifact cost.
Post-TTT Temperature Calibration #576 @cmcdnd (T=0.98 re-score after legal TTT to correct overconfidence, −0.003 BPB) Novel — no adopters yet. Zero-cost technique.
Walsh-Hadamard Rotation #586 @EaCognitive (pre-quant rotation for outlier redistribution. zstd 1.70x→1.76x, freeing 530KB for VE128) Novel — substitutes with GPTQ at int6 (they address the same outlier problem). Also found Late QAT dead-code bug in CastedLinear.
Late Soft-Round QAT #589 @RoyiRa (temperature-controlled soft-round surrogate replaces hard STE; bin-aware gradients near int6 boundaries) @EthanYangTW (#606, tanh α1→16, best legal TTT 1.1162). Independent discovery likely (~8hr gap, same tanh-alpha approach).
Selective Pruning #609 @saml212 (post-GPTQ ±1 magnitude pruning sorted by reconstruction error) Novel — no adopters yet.
Residual Input Mixing #615 @danialht (dense residual: each block sees learned mix of current stream + earlier blocks + x0) Novel — no adopters yet.
AWQ #623 @SPThole (activation-aware weight scaling α=0.5 before quant, closed 63% quant gap) Novel — first use in competition.
Cyclic Muon Momentum #623 @SPThole (triangle wave 0.85-0.95, period=50) Novel — no adopters yet.

Predictions & Commentary

  1. GEPA + legal TTT is the highest-EV untried combo — now sub-1.10. Non-record: 11L GEPA + 20k Steps + Pure Int6 + Legal TTT (val_bpb=1.0983): unlimited compute: 4×A100-40GB, ~2.8 hours #628 pushed to 1.0983 on 4xA100 (20k steps). Scaling law: 9k→1.116, 12k→1.108, 20k→1.098. On 8xH100 (~7k steps in 600s), projects ~1.116-1.120. With Full GPTQ (untested on GEPA), could be lower. Record-eligible 8xH100 version still untried.

  2. Non-TTT and legal TTT have converged. Best non-TTT (Non-record: 11L XSA-all + Full GPTQ + Selective Pruning (val_bpb=1.1154, 3-seed) #609: 1.1154) and best legal TTT (Record: int5 GPTQ + Soft-Round QAT (3-seed mean 1.1162) #606: 1.1162) are separated by just 0.0008 BPB. The two tracks are effectively at parity. Further gains likely require either (a) GEPA + TTT on 8xH100, (b) systems optimizations (more training steps), or (c) compression innovations (OptRot, better quantization).

  3. Non-record: 11L XSA-all + Full GPTQ + Selective Pruning (val_bpb=1.1154, 3-seed) #609's ablation data narrows the frontier. 16 techniques tested on the current best stack — most hurt or are neutral. VRL conflicts with VE128. Gated Attention's step overhead isn't worth it. lzma is at 99.7% Shannon entropy limit. The "easy stacking" era is over — future gains require genuine innovation, not combination.

  4. Calibration-time ruling could reshape the record table. @valerio-oai questioned whether GPTQ calibration during eval constitutes "accessing training data at eval time." If ruled invalid, all Full GPTQ submissions (including Non-record: 11L XSA-all + Full GPTQ + Selective Pruning (val_bpb=1.1154, 3-seed) #609) would need to restructure calibration into training time. 37 days remain.


Full Official Leaderboard (18 entries)
Rank Score Author Key Techniques PR
1 1.1194 @sanjeevmadhav LeakyReLU² + Legal Score-First TTT + Parallel Muon on #414 stack #549
2 1.1228 @signalrush 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 #414
3 1.1248 @jfprincz 11L Partial RoPE + LN Scale + EMA + XSA4 #315
4 1.1271 @jfprincz 11L XSA4 + EMA + Int6 MLP3x #287
5 1.1307 @unnir 11L Efficient Partial XSA #265
6 1.1458 @raahilshah Int6 MLP3x + SmearGate + BigramHash + OrthoInit + MuonWD + SWA #162
7 1.1502 @aruniyer 11L + Int6 QAT + MLP3x + WD 0.04 + zstd-22 #86
8 1.1556 @aquariouseworkman SmearGate + OrthoInit + Int6 STE QAT + MLP3x + Sliding Window #65
9 1.1586 @yahya010 10L Int6 QAT + Zstd MLP2.6x + Muon 0.99 + Sliding Window #63
10 1.1630 @aquariouseworkman Mixed int6/int8 + MLP3x + Sliding Window #65
11 1.1748 @notapplica Sliding Window + FP16 Embed + 10L + Muon WD + Spectral Init #60
12 1.1925 @mattqlf Sliding Window Eval (stride=64) #50
13 1.1928 @samacqua LoRA Test-Time Training #77
14 1.2014 @spokane-way 4k seq length + tuned hyperparams #52
15 1.2060 @spokane-way 2048 seq length #49
16 1.2147 @nanlliu 10 layers, mixed int8/int6 #39
17 1.2197 @chonchiog FP16 Tied Embedding + LR/Warmdown Tuning #42
18 1.2244 Baseline 9L 512dim 1024vocab TiedEmbed 4 KV heads
All Pending Validated Submissions (44 entries)

Validated against the SOTA at submission time. Δ nats shown vs SOTA at time of validation.

BPB Author Δ nats Seeds Techniques PR
1.1154 @saml212 0.012 3 XSA-all + Full GPTQ + Selective pruning + Parallel Muon. No TTT. ⚠️ Reclassified non-record (GPTQ calibration outside training budget). #609
1.1162 @EthanYangTW 0.011 3 ✅ Int5 GPTQ + Soft-Round QAT (tanh α1→16) + legal cosine AdamW TTT. 33.6M params. #606
1.1169 @danialht 0.010 3 ✅ Residual Input Mixing + mixed int6 GPTQ + grouped AdamW TTT + MLP 3.5x. Legal TTT. #615
1.1171 @raahilshah 0.004 3 XSA-all + Full GPTQ + Parallel Muon + Selective Pruning + LZMA + amax-aligned QAT. No TTT. Same stack as #609. #634
1.1180 @kshitizz36 0.002 3 Full GPTQ + LeakyReLU² + Parallel Muon. No TTT. Independent reproduction of #593 direction. Std=0.0010. #626
1.1181 @JoeProAI 0.042 3 ✅ GEPA arch without TTT: Star-ReLU + U-Net Skip Gates + XSA4 + VE128. #505
1.1204 @raahilshah 0.038 3 ✅ LeakyReLU² + Full GPTQ + QAT-export alignment. No TTT. Std=0.0001. #535
1.1208 @newjordan 0.003 3 XSA-all(11) + GPTQ (block64, percdamp=0.002). TTT slightly hurt (+0.0002). No effective TTT. #587
1.1214 @abaybektursun 0.036 3 ✅ Legal score-first TTT (SGD+momentum) + Parallel Muon. #473
1.1215 @newjordan 0.002 3 Full GPTQ (Hessian-aware, block-128) + Early QAT (threshold 0.5, ~1750 steps) + Legal TTT (EMA scoring, cosine LR). On #414 stack. #578
1.1221 @felipe-parodi 0.035 3 ❌ 20-epoch TTT + Partial RoPE + LN Scale. Pre-eval TTT. #398
1.1233 @signalrush 0.033 3 11L + EMA + XSA4 + VE128 + Tight SWA + GPTQ-lite + Late QAT@0.15 + warmdown 3500 + Partial RoPE + LN Scale + FA3. No TTT. #414
1.1250 @jfprincz 0.030 3 11L + Partial RoPE (16/64) + LN Scale + XSA (last 4) + EMA (0.997) + FA3. ⚠️ Late QAT inactive (torch.compile bug). #315
1.1256 @alertcat 0.029 3 11L + #315 stack + TTT (3 ep SGD, freeze 2 blocks) — TTT neutral on #315's base (±0.001) #338
1.1268 @gowtham0992 0.027 3 11L + XSA on ALL 11 layers + GPTQ-lite + EMA(0.997) + Tight SWA + Late QAT + FA3. No TTT. #478
1.1280 @jfprincz 0.025 3 11L + XSA (last 4) + EMA (0.997) + SmearGate + BigramHash + WD 0.04 + FA3 #287
1.1299 @chanwoo-park-official 0.022 3 11L + CANON-AC(last 5) + DeltaGate (−0.006 BPB, 10% step cost) + XSA4 + Tight SWA + Partial RoPE + LN Scale + Late QAT. Unique technique — no other submission uses CANON. #400
1.1299 @kasimte 0.022 3 11L + #374 base (Tight SWA + VE128 + XSA4) + 3-epoch SGD TTT. ⚠️ TTT #455
1.1309 @parinzee 0.020 3 11L + LeakyReLU(0.5)² (preserves negative gradient flow vs relu²) + XSA4 + EMA + Partial RoPE + Int6 + 524K batch + warmdown 4500 (55% of training). No TTT. Std=0.00017 (8× tighter than typical — suggests more stable training dynamics). Built on #180 base. #493
1.1313 @timowhite88 0.019 3 11L Int6 MLP3x + SmearGate + TTT (3 ep SGD, freeze 2 blocks) + RoPE50K + SWA + WD 0.04 + FA3 ⚠️ pre-eval TTT ruled invalid #254
1.1320 @saml212 0.018 3 12L + Gradient-Guided Quant (int7/6/5) + Partial RoPE + LN Scale + XSA4 + EMA + 524K batch + MLP 1408 #332
1.1326 @jfprincz 0.017 3 11L + Int6 MLP3x + SmearGate + BigramHash + WD 0.04 + SWA + FA3 #198
1.1327 @sofiabod 0.017 3 7L + MLP3x + BigramHash(2048) + SmearGate + AdamW TTT 5ep + int8+zlib. ⚠️ TTT #489
1.1328 @signalrush 0.017 3 11L + XSA4 + EMA + SmearGate + BigramHash(4096) + NTK-RoPE + Int5-MLP + 524K batch + FA3 + adaptive pruning (10-14%) #369
1.1400 @saml212 0.005 3 11L Int6 + SmearGate + BigramHash + 524K batch + SWA + WD 0.04 #236
1.1402 @andrewgcodes 0.017 3 10L Int5-MLP + BigramHash(16384) + Causal TTT + SWA(0.3) + WD 0.08 + 786K batch #267
1.1468 @unixmadtoonslab 0.047 3 12L Int5-MLP + SmearGate + BigramHash + SWA + no QAT #76
1.1472 @devin-cog 3 11L + Int6 + Muon WD 0.038 + LR 0.025 + Sliding Window #179
1.1480 @baudrillardsgh0st 0.045 3 11L + Int6 QAT + Per-Dim SmearGate + SWA + WD 0.038 ⚠️ artifact 16.08MB #194
1.1507 @dexhunter 0.041 3 Int6 STE + SmearGate + Seq2048 + OrthoInit + RoPE50K + SWA/100 #206
1.1507 @SPThole 3 AWQ (activation-aware weight scaling, α=0.5) + Cyclic Muon Momentum (0.85-0.95 triangle) + ReLU² + 11L Shared (last block reused). AWQ closed 63% quant gap. #623
1.1526 @MatoTeziTanka 3 PROTEUS v9: 11L INT6 + single-epoch LoRA TTT (score-then-train, compliant with Mar 24 ruling). TTT gain: −0.025. First post-sweep legal LoRA TTT. #633
1.1538 @jfprincz 0.035 3 OrthoInit + Int6 MLP3x + SmearGate + BigramHash + FA3 #164
1.1541 @alertcat 0.035 3 12L Int5-MLP + Int6-Attn + SmearGate + BigramHash + SWA #219
1.1546 @tamoghnokandar 0.034 3 Int6 MLP3x + NorMuon + FA3 + selective precision #173
1.1558 @JayCheng113 0.032 3 11L + Low-Rank Q (r=192) + Int6 + Sliding Window #215
1.1575 @saml212 0.029 3 Int6 + MLP 3x + selective precision + long-context #114
1.1577 @yahya010 0.029 3 Int6 QAT + BigramHash + MLP 1344 + MuonWD 0.02 + Sliding Window #150
1.1602 @dexhunter 0.025 3 Int6 STE + NorMuon + SWA + MLP3x + Sliding Window + U-Net skips #156
1.1605 @seanward 0.021 3 Int6 MLP3x + MTP + Sliding Window (mean 1.1625) #88
1.1605 @takhir-iota 0.022 3 Int6 MLP3x + Late-K Passthrough + SlidingWindow #99
1.1622 @vmfunc 0.021 3 NorMuon + int6 STE + SWA + sliding window #89
1.1632 @arjun-krishna1 0.020 3 AutoResearch agent + MLP 3x + STE int6 QAT + seq4096 #66
1.1642 @saikrishnarallabandi 0.018 3 Vocab 4096 + MLP 3x + Sliding Window #123
All Not Yet Self-Validated Submissions (21 entries)

Competitive submissions that haven't demonstrated ≥0.005-nat significance. Official SOTA: 1.1194 (updated Mar 24).

BPB Author Seeds Techniques PR
1.0891 @amaljithkuttamath 1 11L + Value Residual + Gated Attention + AdamW TTT on #442 base. Pre-quant 1.1545. ⚠️ TTT #490
1.0920 @Christopher-Lee-McClendon 1 GEPA 30k steps + int6 GPTQ-lite + legal SGD TTT. 4xA100 non-record. #668
1.0944 @Christopher-Lee-McClendon 1 GEPA 25k steps (13k warmdown) + int6 GPTQ-lite + legal SGD TTT. 4xA100 non-record. Float base 1.1088. #644
1.0983 @Christopher-Lee-McClendon 1 GEPA 20k steps (8k warmdown) + int6 GPTQ-lite + legal SGD TTT (10ep). 4xA100 non-record. Float base 1.1153. #628
1.1158 @Robby955 1 Full GPTQ on GEPA + XSA-all + SWA/EMA blend. No TTT. Quant gap 0.004. 8xH100. ⚠️ Calibration question. #639
1.1164 @Asukabot0 1 XSA-all + LeakyReLU² + VRL + GA (no VE128). No TTT. 1xH100 NVL. Pending 8xH100 3-seed. #638
1.1171 @raahilshah 3 XSA-all + Full GPTQ + Parallel Muon + Selective Pruning + LZMA. No TTT. Same #609 stack. 0.00394 nats (fails bar). #634
1.1180 @kshitizz36 3 Full GPTQ + LeakyReLU² + Parallel Muon. No TTT. Std=0.0010. Improves 0.00235 nats (fails 0.005-nat bar). #626
1.1195 @newjordan 3 LeakyReLU² + XSA4 + GPTQ int6+zstd + legal TTT (neutral). Ties SOTA. #656
1.1190 @ChaosCodes 1 GPTQ int6 + SGD TTT + LeakyReLU² on #414 stack. A800 hardware (non-record). Est. ~1.122 on H100. #610
1.1194 @Joeavaib 3 9L "Maestro" arch + LeakyReLU² + Legal TTT + Parallel Muon + GPTQ-lite + LZMA. Ties SOTA (0.00006 nats). #625
1.1246 @unnir 1 11L + Tight SWA (scale<0.2, zero penalty) + Shared VE128 (layers 9-10) + Partial RoPE + LN Scale + Late QAT + XSA4 + SmearGate + FA3 #374
1.1247 @greqone 1 #315 stack + Backout Connection + native FA3 + torch.compile #394
1.1260 @dannywillowliu-uchi 1 #374 stack + GPTQ-lite (per-layer clip percentile search). Self-Distillation TTT: neutral (−0.0003). #379
1.1354 @ibarrajo 1 11L + Partial XSA (last 3) + TTT + 524K batch + RoPE50K (no FA3) ⚠️ pre-eval TTT #290
1.1354 @simonbissonnette 3 11L + EMA + BigramHash(12288) + Mixed Int5 + FA3 (fails p<0.01: t=−6.0 vs −7.0) #466
1.1357 @dennisimoo 1 11L + XSA (last 4) + EMA + SmearGate + BigramHash(2048) + 524K batch + WD 0.04 + torch.compile (SDPA fallback) #307
1.1365 @ofirkris 2 10L + XSA4 + EMA + Partial RoPE + LN Scale + Int5-MLP/Int6-Attn + 3.2% pruning. No TTT. #458
1.1399 @Mapika 3 11L + XSA4 + EMA + SmearGate + BigramHash(2048) + Int5-MLP/Int6-Attn/Int8-Embed + 8% pruning (fails 0.005-nat by 0.00004) #349
1.1419 @chris-buckley 1 11L + XSA4 + EMA + TTT (pre-quant 1.1581; no FA3, SDPA fallback, 5344/9000 steps; seeds 2/3 pending) #317

Glossary
Term Meaning
BPB Bits Per Byte — compression quality. Lower = better
val_bpb BPB on FineWeb validation set
Muon Optimizer: orthogonalized gradients via Newton-Schulz
QAT/STE Quantization-Aware Training / Straight-Through Estimator
Int6/Int8 6-bit or 8-bit integer quantization
SWA/EMA Stochastic Weight Averaging / Exponential Moving Average
TTT Test-Time Training — adapting during evaluation
XSA Exclusive Self-Attention — removes self-value bias
FA3 FlashAttention 3 — optimized H100 attention kernel
LoRA Low-Rank Adaptation — tiny trainable matrices
zstd Zstandard compression (better than zlib)

Changelog
Time Update
Mar 25, 7 AM #659 (1.0920) ruled invalid — post-hoc oracle selection peeks at ground-truth token. +#668 (GEPA 30k→1.0920). New ruling: eval methods can't select scorer after seeing the label.
Mar 24, noon #609 reclassified non-record (calibration ruling). +#634, +#656 (ties SOTA at 1.1195). #650 closed.
Mar 24, 10:30 AM +#633 (PROTEUS v9, post-sweep legal LoRA TTT). +9 new Tier 2 techniques from web research.
Mar 24, 10:10 AM +3 Tier 2 techniques (prune-then-quantize, SLOT TTT, YAQA). #505 artifact >16MB. Enforcement sweep in rulings section.
Mar 24, 9:20 AM +#631 (attempting Tier 1 combo). Tier 2 trimmed. TTT+GPTQ nuance fixed.
Mar 24, 9:07 AM Enforcement sweep: @valerio-oai closed 15+ PRs. #593, #576, #503, #518, #548, #568, #596, #620 removed. GPTQ calibration at eval time disallowed.
Mar 24, 9 AM +#628 (sub-1.10 GEPA+legal TTT at 1.0983). +#599 (Hymba SSM). TTT deep dive condensed. Tier 3 tightened.
Mar 24, 8:50 AM +#626 (1.1180). Tiers updated. Predictions rewritten. Lineage 42→46. Audit: #535 demoted, stale refs fixed.
Mar 24, 8:37 AM #614/#605 closed. +#625 (ties SOTA). +#578. #609 calibration question. VRL + entropy coding dead at frontier.
Mar 24, 8 AM #573 ruled invalid. Closed PRs purged. Record: 11→6. Tier 2 trimmed (−6). Stale refs fixed.
Mar 24, 3–7 AM +#609 (1.1154 best non-TTT). +#606 (1.1162 best legal TTT, Soft-Round QAT). +#615 (1.1169). SOTA→1.1194 (#549 merged). #612 confirms GEPA+legal TTT at 1.1079.
Mar 23, 5–10 PM +#593 (1.1170). +#576 (1.1164). +#589 (1.1178). +#596 (❌ 0.6430 memorization). +DDL, Muon-VS, TEON techniques. Hadamard Rotation lineage. 6 URL fixes.
Mar 23, 1–5 PM #573 (1.0523) disputed then restructured. Pre-eval TTT excluded per #402. +emoji legend. +LaCT, CPSVD, qTTT. +#569 (1.1175 best non-TTT).
Mar 23, 6 AM–1 PM SOTA→1.1228. +#535, #545, #548, #549, #555. Legal TTT surge (#503, #473). TTT rules clarified.
Mar 23, midnight–6 AM +#512 sub-1.0 (0.9512). +#518 (1.0814, LeakyReLU²). +#517, #510, #532. Memorization floor analysis.
Mar 22, 10 PM–midnight +#505 (1.1181 GEPA non-TTT). +#508 (1.1215 legal TTT). Star-ReLU discovery.
Mar 22, 4–10 PM +#462–#499. GEPA TTT. Three-track frontier. LeakyReLU² (#493). Research (×5).
Mar 22, 10 AM–4 PM +#442 (1.1027 AdamW TTT). +#414 (1.1233). +#473 legal TTT.
Mar 21 #398 best (1.1221). #375 $500 negatives. +#362–401. TTT/prefix rulings.
Mar 19–20 Initial build. #315 best (1.1250). Leaderboard→1.1428. Core deep dives.

This commentary is generated by an AI (Claude) analyzing public PR data. No competition code is executed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions