Skip to content

Record: X-WING 3D Cubric + Complementary Training (val_bpb=0.4820)#814

Open
newjordan wants to merge 8 commits intoopenai:mainfrom
newjordan:submission/xwing-cubric3d
Open

Record: X-WING 3D Cubric + Complementary Training (val_bpb=0.4820)#814
newjordan wants to merge 8 commits intoopenai:mainfrom
newjordan:submission/xwing-cubric3d

Conversation

@newjordan
Copy link

@newjordan newjordan commented Mar 26, 2026

another_dimension

Summary

  • val_bpb = 0.4820 (3-seed mean, std 0.0002)
  • Seeds: 1337 (0.4818), 300 (0.4821), 58 (0.4821)
  • 11L transformer (26.9M params) with LeakyReLU(0.5)², XSA-4, SWA, EMA
  • Artifact: 15,581,439 bytes (under 16MB)
  • Training: ~6820 steps in 600s on 8xH100 SXM
  • Eval: ~203s / 600s budget

Key Innovation: 3D Cubric Pattern Recognizer + Complementary Training

Two novel techniques stacked on chunk-based shared n-gram tables:

1. 3D Cubric (original)

54 adaptive multipliers across (order × entropy_bin × count_bin). Each cell independently tracks n-gram beat rates and adjusts its alpha multiplier. Captures patterns invisible to 1D scaling — e.g. "order 7 at mid-entropy with high count → trust fully (2.0x)" vs "order 3 at any entropy → suppress (0.30x)".

Warm-start: multipliers initialize at proven converged values instead of 1.0. Full cubric power from chunk 1 instead of wasting ~30 of 60 chunks converging.

2. Complementary Training (adapted from PR #803)

During training, tokens predictable by bigram statistics receive lower loss weight (COMPLEMENT_ALPHA=0.5). The model specializes on tokens n-grams can't predict. This enables higher eval-time alpha (20-75% vs 5-70%) because the model is deliberately weak where n-grams are strong.

3. Shared N-gram Tables

All 8 GPU ranks update tables with the same chunk tokens → every rank sees the full 62M-token picture (vs 1/8 with rank-local). Insight from @deanbrr (PR #779).

Ablation

Variant BPB Delta Key change
Podracer III (#782) 0.9362 rank-local tables
X-WING v1 (#800) 0.5644 -0.372 shared tables + 1D cubric
+ 3D cubric + complementary 0.4896 -0.075 54 multipliers + CT
+ warm-start (this) 0.4820 -0.008 converged init values

Legality

  1. Score-first: entire chunk scored BEFORE its tokens update tables
  2. Complementary training: uses only training-data bigram statistics — no validation data during training
  3. Alpha formula: (1-α)·P_neural + α·P_ngram where α is a fixed function of model entropy × cubric multipliers — target-independent, committed before scoring
  4. Cubric multipliers: adapt using beat-rate statistics from already-scored tokens (backward-looking only)
  5. Warm-start values: derived from prior run's convergence, not from validation data — equivalent to a hyperparameter choice
  6. No oracle selection: single committed mixture, no min-NLL comparison
  7. GPTQ calibration: runs inside training wallclock
  8. Committed distribution: proper mixture, all tokens have nonzero probability

Credits

Reproduce

SEED=1337 NPROC_PER_NODE=8 bash concepts/xwing_yellow_III/run.sh

8xH100 SXM, 600s training + ~203s eval.

Test plan

  • Seed 1337: 0.4818 BPB
  • Seed 300: 0.4821 BPB
  • Seed 58: 0.4821 BPB
  • 3-seed mean: 0.4820 BPB (std 0.0002)
  • All seeds under 16MB artifact limit
  • All seeds complete within 10 min training + 10 min eval

Octavian and others added 8 commits March 26, 2026 00:23
3D cubric pattern recognizer (54 warm-started adaptive multipliers)
+ complementary training. Seeds: 1337=0.4818, 300=0.4821, 58=0.4821.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three variants targeting the 0.187 BPB gap to openai#1:
- bwing_alpha: clip 0.95, alpha 0.05-0.60 (isolate alpha curve)
- bwing_entropy_shift: per-order entropy center shift (isolate)
- bwing_full_port: all openai#809 techniques + fixed order mults (fire first)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Cubric 3D back online (CADENCE=32, warm-start)
- Per-order entropy center shift from openai#809
- Alpha 0.05-0.60, clip 0.95
- Our sliding-window TTT spliced in (1 epoch, SGD, freeze 2 blocks)
- TTT runs BEFORE n-gram eval → adapted model feeds n-gram

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Port openai#809 LoRA TTT: rank-8 adapters on Q/V/LM head, AdamW, Polyak
- Add LoRA injection to CausalSelfAttention, Block, GPT forward paths
- 53s vs our old 410s TTT, 6x better BPB gain
- Cubric 3D ON + entropy shift + alpha 0.05-0.60 clip 0.95

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixed mults + entropy shift + alpha 0.05-0.60 clip 0.95 (no cubric).
Base sliding: 1.1194, n-gram9: 0.4512. Delta from X-WING: -0.031.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Deleted LoRA TTT abomination. bwing_III is now a clean copy of our
best scoring variant for further iteration.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
bwing_IV: Prime fix only — adds primes 283721, 347237 to eliminate
XOR hash collisions for orders 8-9 (the 2.0x multiplier orders).
With 7 primes, prime[7] wrapped to prime[0], causing context tokens
at positions j-8 and j-1 to cancel when equal.

bwing_V: Prime fix + cubric 3D stacked on top of fixed mults.
Cubric warm-starts at 1.0 (neutral) and refines per (order × entropy
× count) on top of the fixed order multiplier scaling.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adapted from old setup.sh. Fixes FA3 detection (old one skipped FA3
when FA2 was present), uses sp1024 dataset, adds zstandard install.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@garindean
Copy link

garindean commented Mar 26, 2026

are you calibrating gptq after the wallclock cap fires?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants