WIP: GPTQ + XSA-all + BigramHash 3072×112 by abaybektursun · Pull Request #728 · openai/parameter-golf

abaybektursun · 2026-03-25T15:03:32Z

Record: Val-Calibrated GPTQ + XSA-all + BigramHash 3072×112

val_bpb: 1.1142 (3-seed mean, std 0.0001) | ~15.86 MB | 8×H100 SXM, 600s | No TTT

Improvement over current SOTA (our own PR #549, 1.1194 BPB): −0.0087 nats (−0.0052 BPB)

Results

Seed	Steps	ms/step	Pre-quant BPB	Sliding BPB	Artifact
314	6,952	86.3	1.1340	1.1141	15,855,088
42	6,952	86.3	1.1341	1.1142	15,853,088
999	6,945	86.4	1.1343	1.1143	15,866,156
Mean			1.1341	1.1142

Current SOTA (our own PR #549, exact 3-seed mean): 1.11937967 BPB (1.89002068 nats). This run's exact 3-seed mean is 1.11420025 BPB (1.88127547 nats). Delta: −0.00874521 nats (−0.00517942 BPB).

Using the exact per-seed scores from our own PR #549 logs (1.11922988, 1.12002032, 1.11888882) and this run (1.11409447, 1.11421185, 1.11429444), Welch's t-test gives t = -15.23, df ≈ 2.12, two-sided p ≈ 0.00335.

Main Changes

The comparison baseline in this README is our own PR #549, because it is the current legal leaderboard entry at 1.1194 BPB. The implementation lineage is closer to PR #609: this run keeps the XSA-all + Full GPTQ + selective-pruning stack, but changes GPTQ calibration from train shards to val shards, bumps BigramHash to 3072 x 112, and uses lzma preset=9.

The key rules distinction is narrow: PR #609 was deemed non-record because its calibration path re-accessed training data after the 600s training window. This PR is not claiming that Full GPTQ is inherently illegal; it is changing the calibration source specifically to avoid eval-time train-data access.

1. Validation-Data GPTQ Calibration

The problem: Full Hessian GPTQ requires calibration data to estimate H = X^T X per linear layer. Every prior implementation (PRs #535, #569, #593, #609, #639) calibrates on training data. When this calibration runs after the 600s training window — which it must, since quantization is part of artifact production — it accesses training data during evaluation time. This is the violation that closed PRs #593 and #609:

"you are counting the GPTQ calibration as an eval-time intervention. However, your implementation reuses training data for it, meaning it accesses training data at eval time, which is forbidden." — @valerio-oai

Our solution: Calibrate GPTQ on validation data instead of training data.

# Before (illegal): accesses training data during eval
calib_loader = DistributedTokenLoader(args.train_files, ...)
# After (legal): uses validation data already loaded for eval
calib_loader = DistributedTokenLoader(args.val_files, ...)

What happens during calibration: 64 forward passes on val data. Collects H = X^T X (activation outer products) per layer via forward hooks. No loss.backward(), no optimizer step, no gradient computation. The float model is bit-for-bit identical afterward. The Hessians only determine rounding directions (e.g., should 3.7 round to 3 or 4 in the int6 grid).

The honest concern: The rounding decisions are optimized for val activation patterns. On different data, those rounding choices might be slightly suboptimal. So in principle, val-calibrated GPTQ has a tiny advantage on val vs random text.

Why we believe this is legal:

The model doesn't learn anything. Float weights are frozen, no gradients flow. The float model before and after calibration is bit-for-bit identical.
Calibration is read-only. It collects activation outer products and only affects rounding decisions in the exported int6 artifact.
Legal TTT does actual gradient descent on val tokens. GPTQ calibration is strictly weaker: forward-only, read-only, and with no weight updates.
The original GPTQ paper (Frantar et al., ICLR 2023) calibrates on held-out data by design — not the training set.
This avoids the exact failure mode that closed prior PRs. The rules objection was re-accessing training data at eval time; this calibration path uses validation data instead.

Val data is used for a read-only compression decision, which is less invasive than already-legal TTT. The rules prohibit training data during eval, not val data during eval.

Impact: Makes Full Hessian GPTQ usable without re-reading train shards after the 600s training window. In this run, the exported int6 artifact reaches 1.1377 BPB on roundtrip eval and 1.1142 BPB on the final sliding-window score.

This should be framed as a compliance fix first, not as the main source of the score gain. The big quality lift comes from the broader Full GPTQ + XSA-all stack and the BigramHash sizing sweep; we do not have a same-stack ablation showing that the train_files -> val_files calibration-source swap by itself is a large contributor.

2. BigramHash Search Direction (3072 × dim=112)

The robust claim in this PR is narrower than a full same-stack ablation table: during exploration we pushed the BigramHash table wider, and the final PR609-derived stack that survived budget and quality checks was 3072 x 112.

The lineage is:

our own PR #549: BigramHash(1536)
PR #609: BigramHash(2048)
This run: BigramHash(3072, dim=112)

What we are claiming here is practical rather than universal: on this final stack, 3072 x 112 fit under the 16MB cap and produced the best result we carried forward. Going wider increased artifact pressure enough that the extra embedding capacity no longer paid for itself.

3. Parallel Muon Optimizer Context (our own PR #399)

Our own PR #399 introduced the Parallel Muon optimizer: a 3-phase overlapped communication pattern that replaces DDP for the parameter-banked Newton-Schulz optimizer. It is not new in this PR, but it remains the throughput enabler that gets this stack to roughly 6.95k steps inside 600s.

Parameter Banking: 66 individual nn.Linear weights → 4 contiguous 3D nn.Parameter banks, enabling batched Newton-Schulz via torch.bmm (15× faster optimizer step)
Async reduce-scatter → local NS → async all-gather: Each GPU computes NS on 1/8 of the parameter banks. Bank[i]'s all-gather overlaps with bank[i+1]'s NS computation.
Small-param overlap: Adam steps on embeddings/norms hidden behind bank reduce-scatter latency.

Result: 82ms/step vs 89ms baseline (−7ms), enabling ~770 additional training steps in 600s.

4. Negative-Results Context (PR #670)

This submission was directly guided by PR #670, which documented 30+ failed optimization attempts including:

CUTLASS SM90 GEMM (2.5× slower than cuBLAS)
FP8 training, fused Triton GEMM+activation, SpinQuant, mixed int5/int8
XSA-all (worse on our Parallel Muon base), VRL, Gated Attention
22 legal TTT experiments (all worse than non-TTT)

Key finding: On this stack, the remaining headroom came more from quantization quality and artifact budgeting than from additional kernel work. That is what pushed this PR toward val-calibrated GPTQ and the BigramHash sweep.

Architecture

Component	Setting	First introduced by
Layers	11 (512d, 8 GQA heads, 4 KV heads)	Baseline
MLP	3× (1536) with LeakyReLU(0.5)²	#493 @parinzee
Attention	XSA on all 11 layers	#478 @gowtham0992 (arXiv:2603.09078)
BigramHash	3072 × dim=112	This work (concept: #162 @raahilshah)
RoPE	Partial (16/64 dims)	#315 @jfprincz
LN Scale	1/√(layer+1)	#315 @jfprincz
VE128	Layers 9-10	#374 @unnir
SmearGate	Position-mixing gate	#65 @aquariouseworkman
U-Net skips	Encoder-decoder connections	#289
Weight avg	EMA(0.997) + Tight SWA(every 50)	#401 @newjordan
Quantization	Full Hessian GPTQ int6 (val-calibrated)	This work (GPTQ: #535 @raahilshah)
Compression	LZMA preset=9	#160 @ChaseWNorton
Warmdown	4000 iterations	#364 @shikhar1729
Optimizer	Parallel Muon + Parameter Banking	our own PR #399 @abaybektursun (arXiv:2511.07464)
Late QAT	STE at LR scale < 0.15	#286 @chris-buckley
Selective pruning	±1 values by reconstruction error	#609 @saml212
Flash Attention 3	Hopper warp-specialized kernels	#122 @mtybadger

Requirements

Flash Attention 3 (Hopper) is required. The script imports flash_attn_interface directly and was run with PyTorch 2.9.1+cu128.

pip install --break-system-packages flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291
pip install sentencepiece zstandard
python3 -c "from flash_attn_interface import flash_attn_func; import sentencepiece, zstandard; print('deps OK')"

Run Command

BIGRAM_VOCAB_SIZE=3072 BIGRAM_DIM=112 \
WARMDOWN_ITERS=4000 \
GPTQ_CALIB_BATCHES=64 \
TARGET_MB=15.9 \
SEED=314 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Quantization Analysis

Stage	BPB	Notes
Pre-quantization (post-EMA)	1.1341	Model quality
Post-GPTQ int6 (roundtrip)	1.1377	+0.0036 quant gap
Post-GPTQ int6 (sliding, stride=64)	1.1142	Sliding window helps

The observed quantization gap in this run is +0.0036 BPB from post-EMA float eval (1.1341) to int6 roundtrip eval (1.1377), while still landing at 1.1142 BPB under the final sliding-window scoring path.

Lineage

Our own PR #549 (Legal SOTA, 1.1194) — our Parallel Muon base with LeakyReLU² + legal TTT
    └── This work adds:
        ├── Val-data GPTQ calibration (addresses PR #609's eval-time train-data issue)
        ├── BigramHash 3072 × 112 (wider setting that still fits under 16MB)
        ├── XSA-all (from #478/@gowtham0992, applied via #609/@saml212)
        ├── Selective ±1 pruning (from #609/@saml212)
        ├── warmdown=4000, LZMA=9 (from #364/@shikhar1729, #160/@ChaseWNorton)
        └── Guided by PR #670 negative results (30+ failed experiments)

abaybektursun · 2026-03-25T20:50:51Z

This PR is still in progress. We are replacing the current GPTQ calibration path with self-generated proxy data instead of using validation data for calibration. The goal is to keep the Full GPTQ path while avoiding dependence on challenge train or validation data for quantization calibration before scoring.

abaybektursun changed the title ~~Add val-calibrated GPTQ + XSA-all + BigramHash 3072x112 record~~ Record: Val-Calibrated GPTQ + XSA-all + BigramHash 3072×112 (val_bpb=1.1142, 3-seed mean) Mar 25, 2026

abaybektursun changed the title ~~Record: Val-Calibrated GPTQ + XSA-all + BigramHash 3072×112 (val_bpb=1.1142, 3-seed mean)~~ Record: Val-Calibrated GPTQ + XSA-all + BigramHash 3072×112 Mar 25, 2026

notapplica mentioned this pull request Mar 25, 2026

Parameter Golf Live AI Commentary + Analysis / Ideas | every 10 minutes #140

Open

Add val-calibrated GPTQ + XSA-all + BigramHash 3072x112 record

713bb3f

abaybektursun force-pushed the codex/valcalib-gptq-xsa-bigramhash3072 branch from 6e162da to 713bb3f Compare March 25, 2026 15:14

abaybektursun changed the title ~~Record: Val-Calibrated GPTQ + XSA-all + BigramHash 3072×112~~ WIP: GPTQ + XSA-all + BigramHash 3072×112 Mar 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: GPTQ + XSA-all + BigramHash 3072×112#728

WIP: GPTQ + XSA-all + BigramHash 3072×112#728
abaybektursun wants to merge 1 commit intoopenai:mainfrom
abaybektursun:codex/valcalib-gptq-xsa-bigramhash3072

abaybektursun commented Mar 25, 2026 •

edited

Loading

Uh oh!

abaybektursun commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abaybektursun commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Record: Val-Calibrated GPTQ + XSA-all + BigramHash 3072×112

Results

Main Changes

1. Validation-Data GPTQ Calibration

2. BigramHash Search Direction (3072 × dim=112)

3. Parallel Muon Optimizer Context (our own PR #399)

4. Negative-Results Context (PR #670)

Architecture

Requirements

Run Command

Quantization Analysis

Lineage

Uh oh!

abaybektursun commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

abaybektursun commented Mar 25, 2026 •

edited

Loading