Record Submission: 1.0541 BPB - 5-expert Hedge Mixer + CROWN-Q + stride=64 by RoyiRa · Pull Request #700 · openai/parameter-golf

RoyiRa · 2026-03-25T10:48:55Z

Record: 5-expert Hedge Mixer + CROWN-Q + stride=64 (val_bpb=1.0541)

val_bpb: 1.0541 (3-seed mean) | ~15.7 MB | 8xH100 SXM

Results (8xH100 80GB SXM)

Seed	step_avg	steps	Pre-TTT bpb	Post-TTT bpb	TTT gain	Eval time	Artifact
1337	98.1ms	5,935	1.1251	1.0473	-0.0778	336s	15.89 MB
42	97.9ms	5,947	1.1264	1.0686	-0.0578	336s	15.69 MB
7	98.0ms	5,940	1.1246	1.0465	-0.0781	336s	15.66 MB
Mean			1.1254	1.0541	-0.0713	336s	~15.75 MB

Contributions

1. CROWN-Q Training Penalty (training-time)

Added a quantization-aware penalty during warmdown that penalizes weights sensitive to quantization error:

crown_q_loss = lambda * mean(w^2 * delta^2 / 12)

where delta = row_max / clip_range is the per-row quantization step size. This encourages weights to be quantization-friendly, reducing post-quantization degradation. CROWN_Q_LAMBDA=0.01.

Effect: Slightly better compression (artifact ~200KB smaller) and more robust quantization.

2. Eval stride 32 -> 64 (eval-time)

Changed sliding window stride from 32 to 64 during evaluation. Experiment showed identical BPB quality but 2x faster scoring. Frees ~100s of eval budget for more TTT epochs.

3. TTT Epochs 3 -> 4 (eval-time)

Increased test-time training from 3 to 4 epochs per chunk, using the time freed by stride=64. Each additional epoch adapts the model more to scored data. Tested 8 epochs but that overfits (1.0735 vs 1.0473 for 4 epochs).

Combined Effect

stride=64 saves ~100s of eval time
4th TTT epoch uses ~85s of the saved time
Net eval time: ~336s (down from ~562s), well within 600s budget
BPB improvement: 1.0745 -> 1.0541 (-0.0204)

Architecture

Component	Setting
Layers	11 (512d, 8H, 8KV)
MLP	3.5x with LeakyReLU(0.5)^2
BigramHash	6144 (dim=128)
XSA	All 11 layers (ws=8)
VE128	Layers 9-10
Quantization	Full GPTQ int5 + zstd level 22
Pruning	3% magnitude
TTT	AdamW lr=0.0001, 4 epochs, 131K chunks, Polyak 0.998
Mixer	5-expert Hedge (neural, unigram, bigram, trigram, entropy)
Training reserve	18s (for EMA + calibration + quantization)
Early warmdown	LR schedule targets 582s
CROWN-Q	lambda=0.01 during warmdown
Eval stride	64 (was 32)

Reproduction

DATA_PATH=../data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=../data/tokenizers/fineweb_1024_bpe.model \
SEED=1337 MAX_WALLCLOCK_SECONDS=600 \
USE_MIXER=1 MIXER_ETA=0.1 \
TTT_EPOCHS=4 TTT_FREEZE_BLOCKS=2 \
TTT_LR=0.0001 TTT_CHUNK_TOKENS=131072 \
ADAPTIVE_LR=1 ADAPTIVE_LR_MAX=3.0 \
EVAL_STRIDE=64 \
CROWN_Q_LAMBDA=0.01 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Compliance

Constraint	Limit	Actual	Status
Train time	600s	582s	Pass
Eval time	600s	336s	Pass
Artifact size	16,000,000 bytes	15,892,040 bytes (worst seed)	Pass
No pre-scoring training	—	Score-first TTT: each chunk scored under `inference_mode()` before any training on it	Pass
GPTQ calibration in training budget	—	Runs within 18s training reserve (1.9s actual)	Pass

Credits

Base model: PR Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414 by @signalrush
TTT recipe: PR Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461 by @Christopher-Lee-McClendon
CROWN-Q concept: PR Record: CROWN-Q + Full GPTQ + SWA/EMA Blend — val_bpb 1.1186 (3-seed mean) #693 by @EthanYangTW
5-expert Hedge mixer: PR Record: 5-expert Hedge Mixer + TTT (3-seed mean val_bpb=1.0745) #688

…de=64

Built on PR openai#700 with hyperparameter improvements found via autoresearch-multi combinatorial search: - XSA_LAST_N=6 (extended from 4 to 6 layers) - BIGRAM_VOCAB_SIZE=4096 (doubled from 2048) 3-seed mean: 1.1078 (std 0.0045) Seeds: 42=1.1045, 1337=1.1061, 2025=1.1129 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Record Submission: 1.0541 BPB - 5-expert Hedge Mixer + CROWN-Q + stri…

57d1d2c

…de=64

notapplica mentioned this pull request Mar 25, 2026

Parameter Golf Live AI Commentary + Analysis / Ideas | every 10 minutes #140

Open

agalimova mentioned this pull request Mar 25, 2026

Record Submission: 1.1078 BPB — XSA6 + BigramHash4K on Hedge Mixer Stack #720

Open

5 tasks

RoyiRa force-pushed the submission/2026-03-25-hedge-mixer-crown-q branch 2 times, most recently from 30e7835 to 57d1d2c Compare March 25, 2026 14:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record Submission: 1.0541 BPB - 5-expert Hedge Mixer + CROWN-Q + stride=64#700

Record Submission: 1.0541 BPB - 5-expert Hedge Mixer + CROWN-Q + stride=64#700
RoyiRa wants to merge 1 commit intoopenai:mainfrom
RoyiRa:submission/2026-03-25-hedge-mixer-crown-q

RoyiRa commented Mar 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RoyiRa commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Record: 5-expert Hedge Mixer + CROWN-Q + stride=64 (val_bpb=1.0541)

Results (8xH100 80GB SXM)

Contributions

1. CROWN-Q Training Penalty (training-time)

2. Eval stride 32 -> 64 (eval-time)

3. TTT Epochs 3 -> 4 (eval-time)

Combined Effect

Architecture

Reproduction

Compliance

Credits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

RoyiRa commented Mar 25, 2026 •

edited

Loading