Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# CROWN-Q + Full GPTQ + SWA/EMA Blend

## Summary

- **CROWN-Q**: Curvature-weighted quantization variance penalty applied during warmdown. Encourages weights to settle in flat minima where int6 quantization causes less damage. Penalty: `lambda * mean(h_j) * delta_j^2 / 12` per row, where `h_j = w^2` (curvature proxy) and `delta_j = row_max / 15` (CROWN-Q step size). Note: the GPTQ/QAT quantizer uses clip_range=31; CROWN-Q intentionally uses a larger step size (row_max/15) to over-penalize and push weights further into flat basins.
- **Full Cholesky GPTQ**: Hessian-aware quantization with act-order column permutation, block_size=128, 256-sample calibration from training data. GPTQ runs after the 585s training phase as part of model export.
- **SWA/EMA 50/50 blend**: Stochastic Weight Averaging (every 50 steps during warmdown) blended 50/50 with EMA (decay=0.997).
- **Architecture**: 11L, 512d, GQA 8H/4KV, MLP 3x LeakyReLU(0.5)^2, XSA on last 4 layers (7-10), VRL, BigramHash 3072, partial RoPE 16/64.
- **Eval**: Sliding window with stride=64. No test-time training.

Comment on lines +8 to +10
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

README states “No test-time training” (eval is pure sliding-window inference), but the included logs show ttt:start / ttt_sliding:start being run. Please reconcile this by regenerating logs with TTT disabled, or clarifying in the README that the TTT section in the logs was a separate diagnostic run and not part of the reported score.

Copilot uses AI. Check for mistakes.
## Configuration

```bash
torchrun --standalone --nproc_per_node=8 train_gpt.py

# Key env vars (all defaults in code):
# CROWNQ_LAMBDA=0.01 — CROWN-Q penalty weight
# CROWNQ_WARMDOWN_ONLY=1 — only apply during warmdown
# LATE_QAT_THRESHOLD=0.15 — QAT activation point
# MAX_WALLCLOCK_SECONDS=585 — training budget
# WARMDOWN_ITERS=4000 — warmdown length
# TTT_ENABLED=0 — TTT disabled for this submission
```

## Results

| Seed | Steps | Post-EMA BPB | Sliding BPB | Artifact |
|------|-------|-------------|-------------|----------|
| 1337 | 6613 | 1.1387 | **1.1189** | 15,945,134 |
| 42 | 6612 | 1.1382 | **1.1189** | 15,947,742 |
| 7 | 6613 | 1.1378 | **1.1179** | 15,938,790 |
| **Mean** | | 1.1382 | **1.1186** | |
| **Std** | | | 0.0006 | |

- Step speed: 87ms/step (FA3 Hopper)
- Quant gap (roundtrip): ~0.004 BPB
- Sliding window eval time: ~75s
- Training time: 585s (under 600s budget)

## What is CROWN-Q?

CROWN-Q (Curvature-Regularized Optimization for Weight Noise Quantization) adds a training-time penalty that makes weights more robust to quantization noise:

1. For each weight matrix, compute the per-row quantization step size `delta = row_max / 15`
2. Compute quantization variance `delta^2 / 12` (uniform rounding noise)
3. Weight by curvature proxy `h = mean(w^2)` per row (mean of squared weights)
4. Penalty: `lambda * sum(h * quant_var)` encourages the optimizer to reduce weights in directions where quantization noise is most damaging

The CROWN-Q step size (row_max/15) is intentionally larger than the actual quantizer step size (row_max/31, clip_range=31). This over-penalization pushes weights further into flat basins, providing extra robustness margin against quantization damage.

Applied only during warmdown when QAT is active. Zero eval-time cost.

## Included Files

- `train_gpt.py` — self-contained training script
- `submission.json` — submission metadata
- `README.md` — this file
- `train_seed1337.log` — seed 1337 training log
- `train_seed42.log` — seed 42 training log
- `train_seed7.log` — seed 7 training log
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
{
"author": "Ethan Yang",
"github_id": "EthanYangTW",
"name": "CROWN-Q + Full GPTQ + SWA/EMA Blend",
"blurb": "Curvature-weighted quantization variance penalty (CROWN-Q) during warmdown reduces quantization damage. Full Cholesky GPTQ with act-order, SWA/EMA 50/50 blend, VRL, XSA last 4 layers, LeakyReLU(0.5)^2. Sliding window eval only, no TTT.",
"date": "2026-03-25T06:30:00Z",
"val_loss": 1.8886,
"val_loss_std": 0.0009,
"val_bpb": 1.1186,
"val_bpb_std": 0.0006,
"seeds": [1337, 42, 7],
Comment on lines +6 to +11
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

submission.json deviates from the schema used by the other /records/track_10min_16mb/*/submission.json examples (e.g., missing fields like pre_quant_val_loss, pre_quant_val_bpb, step_stop, wallclock_seconds, eval_time_seconds, and a bytes_model_* breakdown). If any tooling expects the established keys, this new format may break ingestion; consider aligning to the existing schema and adding the additional fields while keeping the per-seed breakdown as extra metadata.

Copilot uses AI. Check for mistakes.
"seed_results": {
"1337": {
"val_bpb": 1.1189,
"val_loss": 1.8891,
"bytes": 15945134
},
"42": {
"val_bpb": 1.1189,
"val_loss": 1.8891,
"bytes": 15947742
},
"7": {
"val_bpb": 1.1179,
"val_loss": 1.8876,
"bytes": 15938790
}
},
"bytes_total": 15947742,
"bytes_code": 95390
}
Loading