Skip to content

Record: Complementary Training + Backoff N-gram Mixer — 0.4377 BPB#1

Open
quietsmile wants to merge 1 commit intomainfrom
submission/complementary-backoff-ngram-mixer
Open

Record: Complementary Training + Backoff N-gram Mixer — 0.4377 BPB#1
quietsmile wants to merge 1 commit intomainfrom
submission/complementary-backoff-ngram-mixer

Conversation

@quietsmile
Copy link
Owner

Summary

Key Techniques

  1. Complementary Training (COMPLEMENT_ALPHA=0.5): bigram-weighted loss reweighting
  2. BackoffNgramMixer: orders 2-10, entropy-adaptive alpha mixing
  3. Legal score-first AdamW TTT: 4 epochs, lr=5e-4, freeze first 2 blocks
  4. Stride=128: negligible BPB impact, halves eval time

Acknowledgment

Based on PR openai#803 by @pentxayc. Core innovation of complementary training is their contribution.

Results

Seed Steps val_bpb eval_time
1337 7,003 0.4377 450s
42 7,011 0.4380 450s

Reproduction of PR openai#803's complementary training approach on 8x L20Z (H100).
Two-seed validation: 0.4377 (seed=1337), 0.4380 (seed=42).

Key: bigram-weighted loss reweighting (COMPLEMENT_ALPHA=0.5) trains the
neural model to specialize on tokens n-gram caches can't predict, combined
with BackoffNgramMixer (orders 2-10) and legal score-first AdamW TTT.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant