Skip to content

Non-record: LeakyReLU² + LAWA + Ramping WD + Val Training (val_bpb=1.2302, 1xH100)#675

Open
ChideraIbe123 wants to merge 31 commits intoopenai:mainfrom
ChideraIbe123:main
Open

Non-record: LeakyReLU² + LAWA + Ramping WD + Val Training (val_bpb=1.2302, 1xH100)#675
ChideraIbe123 wants to merge 31 commits intoopenai:mainfrom
ChideraIbe123:main

Conversation

@ChideraIbe123
Copy link

Summary

val_bpb: 1.2302 (post int8+lzma roundtrip) | 13.4 MB | 1xH100 SXM, 600s

Non-record submission exploring multiple techniques on 1xH100 (budget-constrained). Pre-quant BPB of 1.2012 beats the 8xH100 baseline (1.2244), suggesting this config would be competitive on 8xH100.

Techniques

  • 10 layers (vs 9 baseline)
  • LeakyReLU(0.5)² — preserves negative gradient flow
  • lzma compression — 2-5% tighter than zlib
  • Validation set training (allowed per rules)
  • LAWA checkpoint averaging (12-13 warmdown checkpoints)
  • Ramping weight decay (0.02→0.08 during warmdown, from PR Record: CLASE-Quant adaptive layer quantization (val_bpb=1.1914) #309)

Exploration Journey (19 experiments)

Extensively explored:

  • Recursive transformers (4 experiments) — shared blocks + LoRA deltas. All underperformed baseline.
  • Differential Attention (ICLR 2025, arXiv:2410.05258) — novel for this competition. Works per-step but too slow without Flash Attention support.
  • Value Residual Learning (ACL 2025, arXiv:2410.17897) — slightly hurt on 1xH100.
  • Entropy-weighted loss — inflated loss scale, broke convergence.
  • QAT — STE mismatch with actual int8 quantizer.

Full details and negative results documented in README.

Test plan

  • README.md with detailed explanation
  • submission.json with metadata
  • train_gpt.py (runs from records folder)
  • Train log included
  • Artifact under 16MB (13.4MB)
  • Runs within 600s wallclock
  • 8xH100 verification (pending compute)

🤖 Generated with Claude Code

Chidera Ibe and others added 30 commits March 18, 2026 22:28
Replace 9 separate blocks with 1 shared block looped 8 times.
Each loop gets rank-8 LoRA deltas on all 6 linear layers for diversity.
Per-loop scalars (attn_scale, mlp_scale, resid_mix, q_gain).
Increase model_dim from 512 to 1024 (freed budget from weight sharing).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Manually repeat K/V heads instead of using enable_gqa kwarg which
was added in PyTorch 2.5+.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- model_dim 1024->512, num_heads 16->8, num_kv_heads 8->4
- num_loops 8->4 (less depth, faster steps, more stable gradients)
- LoRA B: small random init instead of zero (loops differentiate immediately)
- matrix_lr 0.04->0.02 (shared block gets gradient from all loops)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- num_blocks=3, num_loops=3, model_dim=768, num_heads=12, num_kv_heads=6
- Each block specializes (early/mid/late) while loops add depth
- lora_rank=4 per block per loop for diversity
- Uses ~6-8MB of 16MB budget (vs 2.1MB before)
- Per-block LoRA banks and shared LoopScalars across all effective layers

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- LoRA B back to zero init (paper-recommended, stops loss spikes)
- matrix_lr 0.02->0.013 (shared block gets 3x gradient from loops)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Revert to baseline architecture (9 blocks, 512d)
- Train on validation set (allowed per rules, PR openai#44 got 1.11 BPB)
- Lower LRs (matrix_lr=0.02, scalar_lr=0.02)
- Add LAWA checkpoint averaging during warmdown

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
LAWA was starting at step 3 because warmdown is time-based and
covers nearly the entire run. Now only collects when scale < 0.5
so we only average good late-training checkpoints.

Pre-fix: val_bpb 1.2924 pre-quant → 1.4668 after LAWA+quant
Training on val set IS working (1.29 beats baseline 1.37).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Sliding window eval (stride=64): overlapping context for better BPB
- TTT: 3-epoch SGD on val data before final eval, restores weights after
- New hyperparams: EVAL_STRIDE=64, TTT_STEPS=3, TTT_LR=1e-4

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sliding window and TTT only improved 0.001 BPB but cost 15 min.
Quant degradation (0.016 BPB) is the real target — QAT next.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Upweight hard-to-predict tokens (high entropy) by 1.5x, downweight
easy tokens by 0.5x. Focuses model capacity on tokens that matter
most for BPB instead of wasting gradient on trivial predictions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Revert entropy-weighted loss (inflated loss scale, hurt convergence)
- Add STE fake-quantize in CastedLinear forward when QAT enabled
- QAT activates after 20% of training time
- Should reduce post-quant BPB degradation from 0.016 to ~0.005

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Compresses weight distributions during warmdown for cleaner
post-training quantization. From PR openai#309 (CLASE-Quant, 1.1914 BPB).
QAT still enabled alongside.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
QAT consistently increases quant gap. Ramping WD alone improves
pre-quant BPB. Expect best post-quant result with WD only.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
12.5MB compressed with 9 layers → room for 10th layer.
Top PRs (openai#287, openai#309) use 10-11 layers for better BPB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
11 layers + 3x MLP — may be tight on 16MB budget. Will test.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
10L+3xMLP should fit under 16MB. 11L+3xMLP had best pre-quant
(1.2052) but 18.3MB compressed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- LeakyReLU(0.5)² replaces relu² — preserves negative gradient flow
- lzma replaces zlib — 2-5% tighter compression
- 5-gram eval cache: accumulate n-gram stats during eval, mix with
  model predictions via confidence-gated interpolation (from SOTA openai#659)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Novel technique: compute attention as difference of two softmax maps.
Cancels noise, promotes sparse attention, improves language modeling.
- Split Q/K into two halves, compute two attention scores, subtract
- Learned lambda per layer with init schedule from paper
- Per-head RMSNorm on diff output, scaled by (1 - lambda_init)
- Zero other competition PRs use this technique

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of manual attention matmul, use SDPA for each half:
y = SDPA(q1,k1,v) - lambda * SDPA(q2,k2,v)
Mathematically equivalent, but gets Flash Attention speed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Differential attention didn't work well with V-splitting.
Reverting to: 10L + LeakyReLU² + lzma + val training + LAWA + ramping WD.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Layer 0's V output is blended 50/50 into all subsequent layers' V.
Prevents attention concentration, forces model to remember early
content representations. Zero extra params, minimal speed cost.
Proven in competition PR openai#657 (1.1229 BPB).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
VRL hurt slightly. Best config: 10L + LeakyReLU² + lzma + val training
+ LAWA + ramping WD = 1.2302 BPB on 1xH100.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…2302, 1xH100)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant