-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Record: CROWN-Q + Full GPTQ + SWA/EMA Blend — val_bpb 1.1186 (3-seed mean) #693
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
17836df
198c913
e627069
da7823e
f4636fd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,60 @@ | ||
| # CROWN-Q + Full GPTQ + SWA/EMA Blend | ||
|
|
||
| ## Summary | ||
|
|
||
| - **CROWN-Q**: Curvature-weighted quantization variance penalty applied during warmdown. Encourages weights to settle in flat minima where int6 quantization causes less damage. Penalty: `lambda * mean(h_j) * delta_j^2 / 12` per row, where `h_j = w^2` (curvature proxy) and `delta_j = row_max / 15` (CROWN-Q step size). Note: the GPTQ/QAT quantizer uses clip_range=31; CROWN-Q intentionally uses a larger step size (row_max/15) to over-penalize and push weights further into flat basins. | ||
| - **Full Cholesky GPTQ**: Hessian-aware quantization with act-order column permutation, block_size=128, 256-sample calibration from training data. GPTQ runs after the 585s training phase as part of model export. | ||
| - **SWA/EMA 50/50 blend**: Stochastic Weight Averaging (every 50 steps during warmdown) blended 50/50 with EMA (decay=0.997). | ||
| - **Architecture**: 11L, 512d, GQA 8H/4KV, MLP 3x LeakyReLU(0.5)^2, XSA on last 4 layers (7-10), VRL, BigramHash 3072, partial RoPE 16/64. | ||
| - **Eval**: Sliding window with stride=64. No test-time training. | ||
|
|
||
| ## Configuration | ||
|
|
||
| ```bash | ||
| torchrun --standalone --nproc_per_node=8 train_gpt.py | ||
|
|
||
| # Key env vars (all defaults in code): | ||
| # CROWNQ_LAMBDA=0.01 — CROWN-Q penalty weight | ||
| # CROWNQ_WARMDOWN_ONLY=1 — only apply during warmdown | ||
| # LATE_QAT_THRESHOLD=0.15 — QAT activation point | ||
| # MAX_WALLCLOCK_SECONDS=585 — training budget | ||
| # WARMDOWN_ITERS=4000 — warmdown length | ||
| # TTT_ENABLED=0 — TTT disabled for this submission | ||
| ``` | ||
|
|
||
| ## Results | ||
|
|
||
| | Seed | Steps | Post-EMA BPB | Sliding BPB | Artifact | | ||
| |------|-------|-------------|-------------|----------| | ||
| | 1337 | 6613 | 1.1387 | **1.1189** | 15,945,134 | | ||
| | 42 | 6612 | 1.1382 | **1.1189** | 15,947,742 | | ||
| | 7 | 6613 | 1.1378 | **1.1179** | 15,938,790 | | ||
| | **Mean** | | 1.1382 | **1.1186** | | | ||
| | **Std** | | | 0.0006 | | | ||
|
|
||
| - Step speed: 87ms/step (FA3 Hopper) | ||
| - Quant gap (roundtrip): ~0.004 BPB | ||
| - Sliding window eval time: ~75s | ||
| - Training time: 585s (under 600s budget) | ||
|
|
||
| ## What is CROWN-Q? | ||
|
|
||
| CROWN-Q (Curvature-Regularized Optimization for Weight Noise Quantization) adds a training-time penalty that makes weights more robust to quantization noise: | ||
|
|
||
| 1. For each weight matrix, compute the per-row quantization step size `delta = row_max / 15` | ||
| 2. Compute quantization variance `delta^2 / 12` (uniform rounding noise) | ||
| 3. Weight by curvature proxy `h = mean(w^2)` per row (mean of squared weights) | ||
| 4. Penalty: `lambda * sum(h * quant_var)` encourages the optimizer to reduce weights in directions where quantization noise is most damaging | ||
|
|
||
| The CROWN-Q step size (row_max/15) is intentionally larger than the actual quantizer step size (row_max/31, clip_range=31). This over-penalization pushes weights further into flat basins, providing extra robustness margin against quantization damage. | ||
|
|
||
| Applied only during warmdown when QAT is active. Zero eval-time cost. | ||
|
|
||
| ## Included Files | ||
|
|
||
| - `train_gpt.py` — self-contained training script | ||
| - `submission.json` — submission metadata | ||
| - `README.md` — this file | ||
| - `train_seed1337.log` — seed 1337 training log | ||
| - `train_seed42.log` — seed 42 training log | ||
| - `train_seed7.log` — seed 7 training log | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,31 @@ | ||
| { | ||
| "author": "Ethan Yang", | ||
| "github_id": "EthanYangTW", | ||
| "name": "CROWN-Q + Full GPTQ + SWA/EMA Blend", | ||
| "blurb": "Curvature-weighted quantization variance penalty (CROWN-Q) during warmdown reduces quantization damage. Full Cholesky GPTQ with act-order, SWA/EMA 50/50 blend, VRL, XSA last 4 layers, LeakyReLU(0.5)^2. Sliding window eval only, no TTT.", | ||
| "date": "2026-03-25T06:30:00Z", | ||
| "val_loss": 1.8886, | ||
| "val_loss_std": 0.0009, | ||
| "val_bpb": 1.1186, | ||
| "val_bpb_std": 0.0006, | ||
| "seeds": [1337, 42, 7], | ||
|
Comment on lines
+6
to
+11
|
||
| "seed_results": { | ||
| "1337": { | ||
| "val_bpb": 1.1189, | ||
| "val_loss": 1.8891, | ||
| "bytes": 15945134 | ||
| }, | ||
| "42": { | ||
| "val_bpb": 1.1189, | ||
| "val_loss": 1.8891, | ||
| "bytes": 15947742 | ||
| }, | ||
| "7": { | ||
| "val_bpb": 1.1179, | ||
| "val_loss": 1.8876, | ||
| "bytes": 15938790 | ||
| } | ||
| }, | ||
| "bytes_total": 15947742, | ||
| "bytes_code": 95390 | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
README states “No test-time training” (eval is pure sliding-window inference), but the included logs show
ttt:start/ttt_sliding:startbeing run. Please reconcile this by regenerating logs with TTT disabled, or clarifying in the README that the TTT section in the logs was a separate diagnostic run and not part of the reported score.