openai · chrislovescoding · Mar 18, 2026 · Mar 18, 2026 · Mar 18, 2026 · Mar 18, 2026
diff --git a/dashboard.html b/dashboard.html
diff --git a/records/track_non_record_16mb/2026-03-25_BitNet_Ternary_65M_12x768/README.md b/records/track_non_record_16mb/2026-03-25_BitNet_Ternary_65M_12x768/README.md
@@ -0,0 +1,66 @@
+# BitNet Ternary: 65M Parameters in 15.9MB
+
+## Summary
+
+Ternary weight quantization ({-1, 0, +1} at ~1.58 bits/weight) enables fitting **65M parameters** in a 15.9MB artifact — 3x the parameter count of standard int6 submissions (~22M params) at similar artifact size.
+
+This explores a fundamentally different axis of optimization: instead of aggressive quantization of a small model, we train a much larger model with extreme quantization from the start.
+
+## Approach
+
+**Architecture:** 12 layers, 768 dim, 12 heads, 6 KV heads (GQA), 3x MLP expansion (hidden=2304), LeakyReLU(0.5)-squared, tied embeddings, U-Net skip connections.
+
+**Ternary Training (STE):** Full-precision weights are maintained by the optimizer. The forward pass quantizes to ternary using Straight-Through Estimator:
+- Per-row scale = mean(|w|) per row
+- Threshold = 0.7 * scale
+- Values above threshold -> +1, below -threshold -> -1, else -> 0
+- Backward pass: gradients flow through as identity (STE)
+
+**Activation schedule:** Full-precision training for the first 30% of wallclock, then ternary STE for the remaining 70%. This lets the model learn representations before adapting to the quantization constraint.
+
+**Compression:** Ternary values {-1,0,1} stored as int8, compressed with zlib-9 (or zstd-22 when available for ~1MB savings). Since there are only 3 distinct values, compression achieves excellent ratios. Per-row fp16 scales for dequantization. Embedding kept as fp16.
+
+**Evaluation:** Sliding window with stride=64 for improved BPB.
+
+## Configuration
+
+```
+VOCAB_SIZE=1024, NUM_LAYERS=12, MODEL_DIM=768
+NUM_HEADS=12, NUM_KV_HEADS=6, MLP_MULT=3
+TRAIN_SEQ_LEN=1024, TRAIN_BATCH_TOKENS=524288
+MATRIX_LR=0.02, SCALAR_LR=0.02, MUON_MOMENTUM=0.99
+WARMDOWN_ITERS=3000, TERNARY_START_FRAC=0.3
+```
+
+## Run Command
+
+```bash
+RUN_ID=bitnet_final torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+## Results
+
+- **Model params:** 64,529,040
+- **Artifact size:** 15,878,267 bytes (code + ternary zlib-9)
+- **Pre-quant val_bpb:** 1.2268
+- **Post-quant val_bpb:** 1.2271
+- **Quantization gap:** 0.0003 BPB
+- **Sliding window val_bpb:** 1.1932 (stride=64)
+- **Steps:** 5,026 / 20,000 (wallclock cap at 600s)
+- **Step avg:** 119.38 ms
+- **Peak memory:** 23,774 MiB
+- **Training tokens:** ~2.6B (5,026 steps x 524,288 tokens/step)
+
+## Key Findings
+
+1. **Ternary training works at 65M scale** in a 10-minute budget — the loss recovers fully after the ternary transition.
+2. **Quantization gap is near-zero** (~0.0003 BPB) because the model is trained with ternary STE.
+3. **3x more parameters** fit in the same artifact budget compared to int6 quantization.
+4. The ternary approach opens a new frontier for parameter-constrained language modeling that is orthogonal to the int6/GPTQ approaches used by other submissions.
+
+## Files
+
+- `README.md` — This file
+- `submission.json` — Run metadata
+- `train.log` — Full training log
+- `train_gpt.py` — Training script (renamed from train_bitnet.py)
diff --git a/records/track_non_record_16mb/2026-03-25_BitNet_Ternary_65M_12x768/submission.json b/records/track_non_record_16mb/2026-03-25_BitNet_Ternary_65M_12x768/submission.json
@@ -0,0 +1,19 @@
+{
+  "author": "Chris",
+  "github_id": "chrislovescoding",
+  "name": "BitNet Ternary 65M 12x768",
+  "blurb": "Ternary weights {-1,0,1} at 1.58 bits/weight: 65M params in 15.9MB. 3x more parameters than int6 submissions. Trained with STE, near-zero quantization gap (0.0003 BPB). Sliding window val_bpb: 1.1932.",
+  "date": "2026-03-25T01:00:00Z",
+  "track": "non-record-16mb",
+  "val_loss": 2.07197067,
+  "val_bpb": 1.22713774,
+  "pre_quant_val_loss": 2.0715,
+  "pre_quant_val_bpb": 1.2268,
+  "sliding_window_val_loss": 2.01474326,
+  "sliding_window_val_bpb": 1.19324548,
+  "step_stop": 5026,
+  "wallclock_seconds": 600.0,
+  "bytes_total": 15878267,
+  "bytes_model_zlib": 15834135,
+  "bytes_code": 44132
+}
diff --git a/records/track_non_record_16mb/2026-03-25_BitNet_Ternary_65M_12x768/train.log b/records/track_non_record_16mb/2026-03-25_BitNet_Ternary_65M_12x768/train.log
@@ -0,0 +1,59 @@
+logs/bitnet_final.txt
+model_params:64529040 model_type:bitnet_ternary
+num_layers:12 model_dim:768 num_heads:12
+world_size:8 grad_accum_steps:1
+train_batch_tokens:524288 train_seq_len:1024
+max_wallclock_seconds:600.0 seed:1337
+warmup_step:10/20
+warmup_step:20/20
+step:0/20000 val_loss:6.9656 val_bpb:4.1254 train_time:0ms step_avg:0.01ms
+step:1/20000 train_loss:6.9655 train_time:151ms step_avg:151.32ms
+step:2/20000 train_loss:16.5100 train_time:239ms step_avg:119.31ms
+step:3/20000 train_loss:11.6327 train_time:350ms step_avg:116.58ms
+step:4/20000 train_loss:7.9948 train_time:461ms step_avg:115.15ms
+step:5/20000 train_loss:6.3160 train_time:572ms step_avg:114.34ms
+step:6/20000 train_loss:7.0953 train_time:683ms step_avg:113.80ms
+step:7/20000 train_loss:6.0534 train_time:794ms step_avg:113.38ms
+step:8/20000 train_loss:5.9401 train_time:907ms step_avg:113.42ms
+step:9/20000 train_loss:5.8418 train_time:1020ms step_avg:113.29ms
+step:10/20000 train_loss:5.7538 train_time:1131ms step_avg:113.07ms
+step:200/20000 train_loss:2.6630 train_time:23579ms step_avg:117.90ms
+step:400/20000 train_loss:2.1816 train_time:47591ms step_avg:118.98ms
+step:600/20000 train_loss:2.3804 train_time:71303ms step_avg:118.84ms
+step:800/20000 train_loss:2.1471 train_time:95465ms step_avg:119.33ms
+step:1000/20000 train_loss:2.2450 train_time:119517ms step_avg:119.52ms
+step:1000/20000 val_loss:2.1955 val_bpb:1.3003 train_time:119548ms step_avg:119.55ms
+step:1200/20000 train_loss:2.2516 train_time:143260ms step_avg:119.38ms
+step:1400/20000 train_loss:2.3125 train_time:167463ms step_avg:119.62ms
+ternary:activated step:1517 elapsed_ms:180044
+step:1600/20000 train_loss:2.2312 train_time:221825ms step_avg:138.64ms
+step:1800/20000 train_loss:2.2818 train_time:245600ms step_avg:136.44ms
+step:2000/20000 train_loss:2.2845 train_time:267583ms step_avg:133.79ms
+step:2000/20000 val_loss:2.2925 val_bpb:1.3577 train_time:267613ms step_avg:133.81ms
+step:2200/20000 train_loss:2.3734 train_time:289499ms step_avg:131.59ms
+step:2400/20000 train_loss:2.3694 train_time:311604ms step_avg:129.83ms
+step:2600/20000 train_loss:2.2143 train_time:333560ms step_avg:128.29ms
+step:2800/20000 train_loss:2.1584 train_time:355471ms step_avg:126.95ms
+step:3000/20000 train_loss:3.1940 train_time:377461ms step_avg:125.82ms
+step:3000/20000 val_loss:2.1844 val_bpb:1.2937 train_time:377493ms step_avg:125.83ms
+step:3200/20000 train_loss:2.2490 train_time:399494ms step_avg:124.84ms
+step:3400/20000 train_loss:2.0703 train_time:421448ms step_avg:123.96ms
+step:3600/20000 train_loss:2.1784 train_time:443326ms step_avg:123.15ms
+step:3800/20000 train_loss:2.1176 train_time:465282ms step_avg:122.44ms
+step:4000/20000 train_loss:2.2336 train_time:487251ms step_avg:121.81ms
+step:4000/20000 val_loss:2.1319 val_bpb:1.2626 train_time:487284ms step_avg:121.82ms
+step:4200/20000 train_loss:2.1740 train_time:509717ms step_avg:121.36ms
+step:4400/20000 train_loss:2.1136 train_time:531597ms step_avg:120.82ms
+step:4600/20000 train_loss:2.1448 train_time:553497ms step_avg:120.33ms
+step:4800/20000 train_loss:2.0699 train_time:575363ms step_avg:119.87ms
+step:5000/20000 train_loss:2.1380 train_time:597196ms step_avg:119.44ms
+step:5000/20000 val_loss:2.0730 val_bpb:1.2278 train_time:597227ms step_avg:119.45ms
+step:5026/20000 val_loss:2.0715 val_bpb:1.2268 train_time:599996ms step_avg:119.38ms
+stopping_early: wallclock_cap train_time:599996ms step:5026/20000
+peak memory: 23774 MiB
+Serialized model zlib-9: 15834135 bytes
+Total submission size: 15878267 bytes
+final_ternary_roundtrip val_loss:2.0720 val_bpb:1.2271 eval_time:3525ms
+final_ternary_roundtrip_exact val_loss:2.07197067 val_bpb:1.22713774
+final_sliding_window val_loss:2.0147 val_bpb:1.1932 stride:64 eval_time:137776ms
+final_sliding_window_exact val_loss:2.01474326 val_bpb:1.19324548