Skip to content

Record: 5-expert Hedge Mixer + TTT (3-seed mean val_bpb=1.0745)#688

Open
RoyiRa wants to merge 3 commits intoopenai:mainfrom
RoyiRa:submission-2026-03-24
Open

Record: 5-expert Hedge Mixer + TTT (3-seed mean val_bpb=1.0745)#688
RoyiRa wants to merge 3 commits intoopenai:mainfrom
RoyiRa:submission-2026-03-24

Conversation

@RoyiRa
Copy link

@RoyiRa RoyiRa commented Mar 25, 2026

Summary

3-seed mean val_bpb: 1.0745 (std 0.021) | <15.5 MB | 8xH100 SXM, 600s

Results

Seed Pre-TTT BPB Post-TTT BPB Artifact
1337 1.1248 1.0560 15.48 MB
42 1.1257 1.0970 15.41 MB
7 1.1251 1.0704 15.43 MB
Mean 1.1252 1.0745

Key Technique: 5-expert Logistic Context Mixer

GPU-vectorized online context mixing using the Hedge algorithm. Five experts blend predictions in log-probability space during TTT eval:

Expert Source
Neural Base model log-softmax
Unigram Token frequency from scored tokens
Bigram P(next | prev) from scored tokens
Trigram Hashed P(next | prev2, prev1) with 64K buckets
Entropy Neural model entropy as confidence regularizer

N-gram tables built incrementally from already-scored tokens only (legal). Expert weights updated online via Hedge: log_w -= eta * loss.

Each expert produces an NLL for every token. The mixer maintains learned weights (one per expert) updated via the Hedge algorithm. At each position, the mixed prediction is:
mixed_NLL = -log(sum_k w_k * exp(-NLL_k))

Training Budget

GPTQ calibration runs within the 600s training budget (18s reserved).

Phase Time
Training loop 582s
EMA + GPTQ calibration + quantization ~18s
Total training ~600s
TTT eval with mixer ~562s

Reproduction

pip install -r requirements.txt
SEED=1337 MAX_WALLCLOCK_SECONDS=600 USE_MIXER=1 TTT_LR=0.0001 TTT_CHUNK_TOKENS=131072 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant