Skip to content

Non-record: Fixed Bank QAT + XSA5 + Label Smoothing (1.1352)#667

Open
suchitj2702 wants to merge 1 commit intoopenai:mainfrom
suchitj2702:submission/fixed-qat-xsa5-label-smoothing
Open

Non-record: Fixed Bank QAT + XSA5 + Label Smoothing (1.1352)#667
suchitj2702 wants to merge 1 commit intoopenai:mainfrom
suchitj2702:submission/fixed-qat-xsa5-label-smoothing

Conversation

@suchitj2702
Copy link

Summary

Non-record experimental submission exploring three changes on top of the LeakyReLU + Legal TTT + Parallel Muon stack:

  • Fixed broken Bank QAT: Implemented STE int6 fake-quantization directly in GPT.forward() for all bank parameters. The SOTA's QAT is dead code because bank params bypass CastedLinear and torch.compile constant-folds the _qat_enabled flag.
  • XSA expanded to 5 layers (from 4)
  • Label smoothing 0.05 added to cross-entropy loss
  • TTT hyperparameter tuning: LR 0.003 (from 0.002), momentum 0.95 (from 0.9)

Result: 1.1352 BPB — does not beat SOTA (1.1194). Submitted as a non-record with findings.

Key Findings

Change Impact
QAT fix Sound idea but recompilation costs ~50s + 5ms/step → 460 fewer training steps
Label smoothing Counterproductive — model is compute-limited, not overfitting
XSA5 Neutral to slightly negative vs XSA4
TTT LR/momentum tuning Original values (0.002/0.9) were better

Test plan

  • Ran on 1xH100 SXM (smoke test, 907 steps) — all code paths work
  • Ran on 8xH100 SXM with QAT enabled — 6,719 steps, 1.1376 BPB
  • Ran on 8xH100 SXM without QAT — 7,062 steps, 1.1352 BPB
  • Artifact size: 15.44 MB (under 16 MB cap)
  • Eval time: ~530s (under 600s cap)

🤖 Generated with Claude Code

Non-record experimental submission exploring:
- STE int6 fake-quantization fix for bank parameters (QAT was dead code)
- XSA expanded to last 5 layers
- Label smoothing 0.05
- TTT LR/momentum tuning

Result: 1.1352 BPB (worse than SOTA 1.1194). Key findings:
- QAT recompilation too expensive (~50s + 5ms/step overhead)
- Label smoothing counterproductive on compute-limited model
- XSA5 neutral-to-negative vs XSA4
- Original TTT hyperparameters were better

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant