Skip to content

WIP: GPTQ + XSA-all + BigramHash 3072×112#728

Open
abaybektursun wants to merge 1 commit intoopenai:mainfrom
abaybektursun:codex/valcalib-gptq-xsa-bigramhash3072
Open

WIP: GPTQ + XSA-all + BigramHash 3072×112#728
abaybektursun wants to merge 1 commit intoopenai:mainfrom
abaybektursun:codex/valcalib-gptq-xsa-bigramhash3072

Conversation

@abaybektursun
Copy link
Contributor

@abaybektursun abaybektursun commented Mar 25, 2026

Record: Val-Calibrated GPTQ + XSA-all + BigramHash 3072×112

val_bpb: 1.1142 (3-seed mean, std 0.0001) | ~15.86 MB | 8×H100 SXM, 600s | No TTT

Improvement over current SOTA (our own PR #549, 1.1194 BPB): −0.0087 nats (−0.0052 BPB)

Results

Seed Steps ms/step Pre-quant BPB Sliding BPB Artifact
314 6,952 86.3 1.1340 1.1141 15,855,088
42 6,952 86.3 1.1341 1.1142 15,853,088
999 6,945 86.4 1.1343 1.1143 15,866,156
Mean 1.1341 1.1142

Current SOTA (our own PR #549, exact 3-seed mean): 1.11937967 BPB (1.89002068 nats). This run's exact 3-seed mean is 1.11420025 BPB (1.88127547 nats). Delta: −0.00874521 nats (−0.00517942 BPB).

Using the exact per-seed scores from our own PR #549 logs (1.11922988, 1.12002032, 1.11888882) and this run (1.11409447, 1.11421185, 1.11429444), Welch's t-test gives t = -15.23, df ≈ 2.12, two-sided p ≈ 0.00335.


Main Changes

The comparison baseline in this README is our own PR #549, because it is the current legal leaderboard entry at 1.1194 BPB. The implementation lineage is closer to PR #609: this run keeps the XSA-all + Full GPTQ + selective-pruning stack, but changes GPTQ calibration from train shards to val shards, bumps BigramHash to 3072 x 112, and uses lzma preset=9.

The key rules distinction is narrow: PR #609 was deemed non-record because its calibration path re-accessed training data after the 600s training window. This PR is not claiming that Full GPTQ is inherently illegal; it is changing the calibration source specifically to avoid eval-time train-data access.

1. Validation-Data GPTQ Calibration

The problem: Full Hessian GPTQ requires calibration data to estimate H = X^T X per linear layer. Every prior implementation (PRs #535, #569, #593, #609, #639) calibrates on training data. When this calibration runs after the 600s training window — which it must, since quantization is part of artifact production — it accesses training data during evaluation time. This is the violation that closed PRs #593 and #609:

"you are counting the GPTQ calibration as an eval-time intervention. However, your implementation reuses training data for it, meaning it accesses training data at eval time, which is forbidden."@valerio-oai

Our solution: Calibrate GPTQ on validation data instead of training data.

# Before (illegal): accesses training data during eval
calib_loader = DistributedTokenLoader(args.train_files, ...)
# After (legal): uses validation data already loaded for eval
calib_loader = DistributedTokenLoader(args.val_files, ...)

What happens during calibration: 64 forward passes on val data. Collects H = X^T X (activation outer products) per layer via forward hooks. No loss.backward(), no optimizer step, no gradient computation. The float model is bit-for-bit identical afterward. The Hessians only determine rounding directions (e.g., should 3.7 round to 3 or 4 in the int6 grid).

The honest concern: The rounding decisions are optimized for val activation patterns. On different data, those rounding choices might be slightly suboptimal. So in principle, val-calibrated GPTQ has a tiny advantage on val vs random text.

Why we believe this is legal:

  1. The model doesn't learn anything. Float weights are frozen, no gradients flow. The float model before and after calibration is bit-for-bit identical.
  2. Calibration is read-only. It collects activation outer products and only affects rounding decisions in the exported int6 artifact.
  3. Legal TTT does actual gradient descent on val tokens. GPTQ calibration is strictly weaker: forward-only, read-only, and with no weight updates.
  4. The original GPTQ paper (Frantar et al., ICLR 2023) calibrates on held-out data by design — not the training set.
  5. This avoids the exact failure mode that closed prior PRs. The rules objection was re-accessing training data at eval time; this calibration path uses validation data instead.

Val data is used for a read-only compression decision, which is less invasive than already-legal TTT. The rules prohibit training data during eval, not val data during eval.

Impact: Makes Full Hessian GPTQ usable without re-reading train shards after the 600s training window. In this run, the exported int6 artifact reaches 1.1377 BPB on roundtrip eval and 1.1142 BPB on the final sliding-window score.

This should be framed as a compliance fix first, not as the main source of the score gain. The big quality lift comes from the broader Full GPTQ + XSA-all stack and the BigramHash sizing sweep; we do not have a same-stack ablation showing that the train_files -> val_files calibration-source swap by itself is a large contributor.

2. BigramHash Search Direction (3072 × dim=112)

The robust claim in this PR is narrower than a full same-stack ablation table: during exploration we pushed the BigramHash table wider, and the final PR609-derived stack that survived budget and quality checks was 3072 x 112.

The lineage is:

What we are claiming here is practical rather than universal: on this final stack, 3072 x 112 fit under the 16MB cap and produced the best result we carried forward. Going wider increased artifact pressure enough that the extra embedding capacity no longer paid for itself.

3. Parallel Muon Optimizer Context (our own PR #399)

Our own PR #399 introduced the Parallel Muon optimizer: a 3-phase overlapped communication pattern that replaces DDP for the parameter-banked Newton-Schulz optimizer. It is not new in this PR, but it remains the throughput enabler that gets this stack to roughly 6.95k steps inside 600s.

  1. Parameter Banking: 66 individual nn.Linear weights → 4 contiguous 3D nn.Parameter banks, enabling batched Newton-Schulz via torch.bmm (15× faster optimizer step)
  2. Async reduce-scatter → local NS → async all-gather: Each GPU computes NS on 1/8 of the parameter banks. Bank[i]'s all-gather overlaps with bank[i+1]'s NS computation.
  3. Small-param overlap: Adam steps on embeddings/norms hidden behind bank reduce-scatter latency.

Result: 82ms/step vs 89ms baseline (−7ms), enabling ~770 additional training steps in 600s.

4. Negative-Results Context (PR #670)

This submission was directly guided by PR #670, which documented 30+ failed optimization attempts including:

  • CUTLASS SM90 GEMM (2.5× slower than cuBLAS)
  • FP8 training, fused Triton GEMM+activation, SpinQuant, mixed int5/int8
  • XSA-all (worse on our Parallel Muon base), VRL, Gated Attention
  • 22 legal TTT experiments (all worse than non-TTT)

Key finding: On this stack, the remaining headroom came more from quantization quality and artifact budgeting than from additional kernel work. That is what pushed this PR toward val-calibrated GPTQ and the BigramHash sweep.


Architecture

Component Setting First introduced by
Layers 11 (512d, 8 GQA heads, 4 KV heads) Baseline
MLP 3× (1536) with LeakyReLU(0.5)² #493 @parinzee
Attention XSA on all 11 layers #478 @gowtham0992 (arXiv:2603.09078)
BigramHash 3072 × dim=112 This work (concept: #162 @raahilshah)
RoPE Partial (16/64 dims) #315 @jfprincz
LN Scale 1/√(layer+1) #315 @jfprincz
VE128 Layers 9-10 #374 @unnir
SmearGate Position-mixing gate #65 @aquariouseworkman
U-Net skips Encoder-decoder connections #289
Weight avg EMA(0.997) + Tight SWA(every 50) #401 @newjordan
Quantization Full Hessian GPTQ int6 (val-calibrated) This work (GPTQ: #535 @raahilshah)
Compression LZMA preset=9 #160 @ChaseWNorton
Warmdown 4000 iterations #364 @shikhar1729
Optimizer Parallel Muon + Parameter Banking our own PR #399 @abaybektursun (arXiv:2511.07464)
Late QAT STE at LR scale < 0.15 #286 @chris-buckley
Selective pruning ±1 values by reconstruction error #609 @saml212
Flash Attention 3 Hopper warp-specialized kernels #122 @mtybadger

Requirements

Flash Attention 3 (Hopper) is required. The script imports flash_attn_interface directly and was run with PyTorch 2.9.1+cu128.

pip install --break-system-packages flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291
pip install sentencepiece zstandard
python3 -c "from flash_attn_interface import flash_attn_func; import sentencepiece, zstandard; print('deps OK')"

Run Command

BIGRAM_VOCAB_SIZE=3072 BIGRAM_DIM=112 \
WARMDOWN_ITERS=4000 \
GPTQ_CALIB_BATCHES=64 \
TARGET_MB=15.9 \
SEED=314 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Quantization Analysis

Stage BPB Notes
Pre-quantization (post-EMA) 1.1341 Model quality
Post-GPTQ int6 (roundtrip) 1.1377 +0.0036 quant gap
Post-GPTQ int6 (sliding, stride=64) 1.1142 Sliding window helps

The observed quantization gap in this run is +0.0036 BPB from post-EMA float eval (1.1341) to int6 roundtrip eval (1.1377), while still landing at 1.1142 BPB under the final sliding-window scoring path.

Lineage

Our own PR #549 (Legal SOTA, 1.1194) — our Parallel Muon base with LeakyReLU² + legal TTT
    └── This work adds:
        ├── Val-data GPTQ calibration (addresses PR #609's eval-time train-data issue)
        ├── BigramHash 3072 × 112 (wider setting that still fits under 16MB)
        ├── XSA-all (from #478/@gowtham0992, applied via #609/@saml212)
        ├── Selective ±1 pruning (from #609/@saml212)
        ├── warmdown=4000, LZMA=9 (from #364/@shikhar1729, #160/@ChaseWNorton)
        └── Guided by PR #670 negative results (30+ failed experiments)

@abaybektursun abaybektursun changed the title Add val-calibrated GPTQ + XSA-all + BigramHash 3072x112 record Record: Val-Calibrated GPTQ + XSA-all + BigramHash 3072×112 (val_bpb=1.1142, 3-seed mean) Mar 25, 2026
@abaybektursun abaybektursun changed the title Record: Val-Calibrated GPTQ + XSA-all + BigramHash 3072×112 (val_bpb=1.1142, 3-seed mean) Record: Val-Calibrated GPTQ + XSA-all + BigramHash 3072×112 Mar 25, 2026
@abaybektursun abaybektursun force-pushed the codex/valcalib-gptq-xsa-bigramhash3072 branch from 6e162da to 713bb3f Compare March 25, 2026 15:14
@abaybektursun abaybektursun changed the title Record: Val-Calibrated GPTQ + XSA-all + BigramHash 3072×112 WIP: GPTQ + XSA-all + BigramHash 3072×112 Mar 25, 2026
@abaybektursun
Copy link
Contributor Author

This PR is still in progress. We are replacing the current GPTQ calibration path with self-generated proxy data instead of using validation data for calibration. The goal is to keep the Full GPTQ path while avoiding dependence on challenge train or validation data for quantization calibration before scoring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant