openai · harborglowvintage-oss · Mar 22, 2026
diff --git a/...ck_10min_16mb/2026-03-22_Int5Int6_BigramHash_SmearGate_SWA_LLMAdvisor/README.md b/...ck_10min_16mb/2026-03-22_Int5Int6_BigramHash_SmearGate_SWA_LLMAdvisor/README.md
@@ -0,0 +1,100 @@
+# 10L Int5/Int6 + BigramHash(10240) + SmearGate + SWA Boost
+
+**LLMAdvisor.ai — powered by HighSignal™**
+
+## Score
+
+**Best val_bpb = 1.14638** (seed=1337, SWA boost config, 8×H100 SXM, 600s wallclock)
+
+| Run | Seed | SWA Config | Steps | val_loss | val_bpb |
+|-----|------|-----------|-------|----------|---------|
+| SWA boost | 1337 | every=30, frac=0.50 (49 ckpts) | 7,372 | 1.93562 | **1.14638** |
+| Standard | 1337 | every=50, frac=0.36 (21 ckpts) | 7,376 | 1.93571 | 1.14644 |
+| WD=2250 | 2024 | every=50, frac=0.36 | 7,387 | 1.93680 | 1.14709 |
+
+Artifact size: **15,736,555 bytes** (263,445 bytes under the 16MB limit)
+
+## Run Command
+
+```bash
+# Setup (once)
+pip install sentencepiece zstandard huggingface_hub
+python cached_challenge_fineweb.py --variant sp1024
+
+# Train + evaluate (best config: SWA boost)
+SEED=1337 SWA_EVERY=30 SWA_START_FRAC=0.50 \
+  torchrun --nproc_per_node=8 --standalone train_gpt.py
+```
+
+Default env vars reproduce the standard run. Override `SEED`, `SWA_EVERY`, and `SWA_START_FRAC` for the SWA boost config above.
+
+## Approach
+
+### 1. Mixed Int5/Int6 Quantization
+- **Int5 [-16,15]** for MLP weights — saves ~1.86MB vs uniform int6, funding the 10th layer
+- **Int6 [-32,31]** for attention weights — precision-sensitive
+- **FP16** for tied embeddings and last-layer key projections
+- **zstd level 22** compression
+
+### 2. BigramHash(10240, dim=128)
+- Hash consecutive token pairs into 10,240-bucket embedding table (dim=128)
+- Projected to model dim=512 via learned linear — captures local token-pair context
+
+### 3. SmearGate
+- Learned per-dimension gate blending current + previous token embeddings
+- Initialized near-identity for stable early training
+
+### 4. SWA Density Sweep
+- **SWA boost**: every=30 steps, start_frac=0.50 → 49 averaged checkpoints (best: 1.14638)
+- **Standard**: every=50 steps, start_frac=0.36 → 21 averaged checkpoints (1.14644)
+- Denser SWA collection provides marginal but consistent improvement
+
+### 5. Reduced Batch Size (622,592 tokens)
+- 75% of the standard 786K batch → ~81ms/step (vs ~117ms at full batch)
+- ~7,370 training steps in 600s wallclock (vs ~5,100)
+- More steps overcomes slightly noisier gradients
+
+## Architecture
+
+| Parameter | Value |
+|-----------|-------|
+| Layers | 10 |
+| Model dim | 512 |
+| Heads | 8 (4 KV heads, GQA) |
+| MLP hidden | 1536 (3× expansion) |
+| Activation | relu² |
+| Vocab size | 1024 (sp1024 BPE) |
+| Embeddings | Tied input/output, FP16 |
+| Init | Orthogonal with muP-scaled outputs |
+| Skip connections | U-Net style |
+
+## Training Hyperparameters
+
+| Parameter | Value |
+|-----------|-------|
+| Optimizer (matrix) | Muon, lr=0.02, momentum=0.99 |
+| Optimizer (embed/scalar) | AdamW, lr=0.02 |
+| Weight decay | 0.04 (decoupled) |
+| Batch size | 622,592 tokens |
+| Sequence length | 2,048 |
+| Warmup | 20 steps |
+| Warmdown | 3,000 iters |
+| Gradient clipping | 0.3, 3% magnitude pruning |
+| SWA (boost) | start_frac=0.50, every=30 steps |
+| Wall clock cap | 600 seconds |
+
+## Evaluation
+
+- Sliding-window evaluation with stride=64
+- BPB = (val_loss / ln(2)) × (tokens / bytes)
+- Eval time: ~259 seconds
+
+## Hardware
+
+- 8× NVIDIA H100 SXM GPUs
+- RunPod cloud instance
+- DDP training with torchrun
+
+## Acknowledgments
+
+Built on the SOTA techniques from [thwu1](https://github.com/KellerJordan/parameter-golf/tree/main/records/track_10min_16mb/2026-03-20_10L_Int5MLP_MuonWD04_SWA50) (1.14276 BPB) and [Raahil Shah](https://github.com/KellerJordan/parameter-golf/tree/main/records/track_10min_16mb/2026-03-20_Int6_MLP3x_SmearGate_BigramHash_MuonWD_SWA) (1.1458 BPB). Key adaptations: reduced batch size for faster step throughput, SWA density sweep (every=30/frac=0.50 vs every=50/frac=0.40), and PyTorch version auto-detection for GQA compatibility.
diff --git a/.../track_10min_16mb/2026-03-22_Int5Int6_BigramHash_SmearGate_SWA_LLMAdvisor/submission.json b/.../track_10min_16mb/2026-03-22_Int5Int6_BigramHash_SmearGate_SWA_LLMAdvisor/submission.json
@@ -0,0 +1,21 @@
+{
+  "author": "LLMAdvisor.ai",
+  "github_id": "harborglowvintage-oss",
+  "name": "10L Int5/Int6 + BigramHash(10240) + SmearGate + SWA Boost",
+  "blurb": "10L GQA (8h/4kv), MLP 3x relu^2, BigramHash(10240, dim=128), SmearGate, orthogonal init, U-Net skips. Mixed int5 MLP / int6 attention quantization + FP16 embeddings + zstd-22. Reduced batch (622592 tokens) for faster step throughput (~81ms, ~7370 steps in 600s). SWA boost: every=30 steps, start_frac=0.50, averaging 49 checkpoints. Sliding-window eval stride=64.",
+  "date": "2026-03-22",
+  "val_loss": 1.93561706,
+  "val_bpb": 1.14638448,
+  "seeds": [1337],
+  "seed_results": {
+    "1337_swa_boost": {"val_loss": 1.93561706, "val_bpb": 1.14638448, "note": "SWA_EVERY=30, SWA_START_FRAC=0.50, 49 ckpts"},
+    "1337_standard": {"val_loss": 1.93570710, "val_bpb": 1.14643781, "note": "SWA_EVERY=50, SWA_START_FRAC=0.36, 21 ckpts"},
+    "2024_wd2250": {"val_loss": 1.93680082, "val_bpb": 1.14708558, "note": "WARMDOWN=2250"}
+  },
+  "step_stop": 7372,
+  "wallclock_seconds": 599.989,
+  "eval_time_seconds": 258.721,
+  "bytes_total": 15736555,
+  "bytes_model_int6_zstd": 15673630,
+  "bytes_code": 62925
+}