Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
3c19182
Add recurrent depth-looped transformer for Parameter Golf
chrislovescoding Mar 18, 2026
e58cdc2
Add setup.sh for venv + dependencies
chrislovescoding Mar 18, 2026
b580d94
Fix GQA compatibility for 3090 (manual KV head repeat)
chrislovescoding Mar 18, 2026
0c40991
Skip torch.compile on Windows (no Triton support)
chrislovescoding Mar 18, 2026
b79a3a2
Add training dashboard (single HTML file)
chrislovescoding Mar 18, 2026
4b7e107
Scale up to Config D: 6 blocks, dim=704, hidden=1088
chrislovescoding Mar 19, 2026
eabe9d0
Add sliding window eval + FP16 embedding quantization
chrislovescoding Mar 19, 2026
78397d6
Default to seq_len=4096 training for better context modeling
chrislovescoding Mar 19, 2026
2cf468e
Add train_v2.py: proven competition stack on baseline architecture
chrislovescoding Mar 19, 2026
6f512d9
Add train_v3.py: full proven competition stack
chrislovescoding Mar 19, 2026
4fbffb6
Add train_v4.py: full stack + int6 QAT + SWA
chrislovescoding Mar 19, 2026
6fd2e7c
Fix sliding window eval crash with torch.compile
chrislovescoding Mar 19, 2026
9771e5c
Add test-time training (TTT) for eval-time adaptation
chrislovescoding Mar 19, 2026
42e4e5b
Fix TTT crash: use fresh uncompiled model for eval-time training
chrislovescoding Mar 19, 2026
2d33cb1
Add train_v5.py: Neural + Classical Hybrid submission
chrislovescoding Mar 24, 2026
dbf3b11
Fix EMA+QAT conflict: skip EMA when QAT is active
chrislovescoding Mar 24, 2026
c00ee9f
Optimize PPM: fully vectorized numpy bigram model (~3s for 62M tokens)
chrislovescoding Mar 24, 2026
43d0d28
Add train_bitnet.py: Ternary BitNet — 65M params in 16MB
chrislovescoding Mar 25, 2026
b205c8f
Non-record: BitNet Ternary — 65M params in 15.9MB (1.1932 BPB)
chrislovescoding Mar 25, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,810 changes: 1,810 additions & 0 deletions dashboard.html

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# BitNet Ternary: 65M Parameters in 15.9MB

## Summary

Ternary weight quantization ({-1, 0, +1} at ~1.58 bits/weight) enables fitting **65M parameters** in a 15.9MB artifact — 3x the parameter count of standard int6 submissions (~22M params) at similar artifact size.

This explores a fundamentally different axis of optimization: instead of aggressive quantization of a small model, we train a much larger model with extreme quantization from the start.

## Approach

**Architecture:** 12 layers, 768 dim, 12 heads, 6 KV heads (GQA), 3x MLP expansion (hidden=2304), LeakyReLU(0.5)-squared, tied embeddings, U-Net skip connections.

**Ternary Training (STE):** Full-precision weights are maintained by the optimizer. The forward pass quantizes to ternary using Straight-Through Estimator:
- Per-row scale = mean(|w|) per row
- Threshold = 0.7 * scale
- Values above threshold -> +1, below -threshold -> -1, else -> 0
- Backward pass: gradients flow through as identity (STE)

**Activation schedule:** Full-precision training for the first 30% of wallclock, then ternary STE for the remaining 70%. This lets the model learn representations before adapting to the quantization constraint.

**Compression:** Ternary values {-1,0,1} stored as int8, compressed with zlib-9 (or zstd-22 when available for ~1MB savings). Since there are only 3 distinct values, compression achieves excellent ratios. Per-row fp16 scales for dequantization. Embedding kept as fp16.

**Evaluation:** Sliding window with stride=64 for improved BPB.

## Configuration

```
VOCAB_SIZE=1024, NUM_LAYERS=12, MODEL_DIM=768
NUM_HEADS=12, NUM_KV_HEADS=6, MLP_MULT=3
TRAIN_SEQ_LEN=1024, TRAIN_BATCH_TOKENS=524288
MATRIX_LR=0.02, SCALAR_LR=0.02, MUON_MOMENTUM=0.99
WARMDOWN_ITERS=3000, TERNARY_START_FRAC=0.3
```

## Run Command

```bash
RUN_ID=bitnet_final torchrun --standalone --nproc_per_node=8 train_gpt.py
```

## Results

- **Model params:** 64,529,040
- **Artifact size:** 15,878,267 bytes (code + ternary zlib-9)
- **Pre-quant val_bpb:** 1.2268
- **Post-quant val_bpb:** 1.2271
- **Quantization gap:** 0.0003 BPB
- **Sliding window val_bpb:** 1.1932 (stride=64)
- **Steps:** 5,026 / 20,000 (wallclock cap at 600s)
- **Step avg:** 119.38 ms
- **Peak memory:** 23,774 MiB
- **Training tokens:** ~2.6B (5,026 steps x 524,288 tokens/step)

## Key Findings

1. **Ternary training works at 65M scale** in a 10-minute budget — the loss recovers fully after the ternary transition.
2. **Quantization gap is near-zero** (~0.0003 BPB) because the model is trained with ternary STE.
3. **3x more parameters** fit in the same artifact budget compared to int6 quantization.
4. The ternary approach opens a new frontier for parameter-constrained language modeling that is orthogonal to the int6/GPTQ approaches used by other submissions.

## Files

- `README.md` — This file
- `submission.json` — Run metadata
- `train.log` — Full training log
- `train_gpt.py` — Training script (renamed from train_bitnet.py)
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
{
"author": "Chris",
"github_id": "chrislovescoding",
"name": "BitNet Ternary 65M 12x768",
"blurb": "Ternary weights {-1,0,1} at 1.58 bits/weight: 65M params in 15.9MB. 3x more parameters than int6 submissions. Trained with STE, near-zero quantization gap (0.0003 BPB). Sliding window val_bpb: 1.1932.",
"date": "2026-03-25T01:00:00Z",
"track": "non-record-16mb",
"val_loss": 2.07197067,
"val_bpb": 1.22713774,
"pre_quant_val_loss": 2.0715,
"pre_quant_val_bpb": 1.2268,
"sliding_window_val_loss": 2.01474326,
"sliding_window_val_bpb": 1.19324548,
"step_stop": 5026,
"wallclock_seconds": 600.0,
"bytes_total": 15878267,
"bytes_model_zlib": 15834135,
"bytes_code": 44132
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
logs/bitnet_final.txt
model_params:64529040 model_type:bitnet_ternary
num_layers:12 model_dim:768 num_heads:12
world_size:8 grad_accum_steps:1
train_batch_tokens:524288 train_seq_len:1024
max_wallclock_seconds:600.0 seed:1337
warmup_step:10/20
warmup_step:20/20
step:0/20000 val_loss:6.9656 val_bpb:4.1254 train_time:0ms step_avg:0.01ms
step:1/20000 train_loss:6.9655 train_time:151ms step_avg:151.32ms
step:2/20000 train_loss:16.5100 train_time:239ms step_avg:119.31ms
step:3/20000 train_loss:11.6327 train_time:350ms step_avg:116.58ms
step:4/20000 train_loss:7.9948 train_time:461ms step_avg:115.15ms
step:5/20000 train_loss:6.3160 train_time:572ms step_avg:114.34ms
step:6/20000 train_loss:7.0953 train_time:683ms step_avg:113.80ms
step:7/20000 train_loss:6.0534 train_time:794ms step_avg:113.38ms
step:8/20000 train_loss:5.9401 train_time:907ms step_avg:113.42ms
step:9/20000 train_loss:5.8418 train_time:1020ms step_avg:113.29ms
step:10/20000 train_loss:5.7538 train_time:1131ms step_avg:113.07ms
step:200/20000 train_loss:2.6630 train_time:23579ms step_avg:117.90ms
step:400/20000 train_loss:2.1816 train_time:47591ms step_avg:118.98ms
step:600/20000 train_loss:2.3804 train_time:71303ms step_avg:118.84ms
step:800/20000 train_loss:2.1471 train_time:95465ms step_avg:119.33ms
step:1000/20000 train_loss:2.2450 train_time:119517ms step_avg:119.52ms
step:1000/20000 val_loss:2.1955 val_bpb:1.3003 train_time:119548ms step_avg:119.55ms
step:1200/20000 train_loss:2.2516 train_time:143260ms step_avg:119.38ms
step:1400/20000 train_loss:2.3125 train_time:167463ms step_avg:119.62ms
ternary:activated step:1517 elapsed_ms:180044
step:1600/20000 train_loss:2.2312 train_time:221825ms step_avg:138.64ms
step:1800/20000 train_loss:2.2818 train_time:245600ms step_avg:136.44ms
step:2000/20000 train_loss:2.2845 train_time:267583ms step_avg:133.79ms
step:2000/20000 val_loss:2.2925 val_bpb:1.3577 train_time:267613ms step_avg:133.81ms
step:2200/20000 train_loss:2.3734 train_time:289499ms step_avg:131.59ms
step:2400/20000 train_loss:2.3694 train_time:311604ms step_avg:129.83ms
step:2600/20000 train_loss:2.2143 train_time:333560ms step_avg:128.29ms
step:2800/20000 train_loss:2.1584 train_time:355471ms step_avg:126.95ms
step:3000/20000 train_loss:3.1940 train_time:377461ms step_avg:125.82ms
step:3000/20000 val_loss:2.1844 val_bpb:1.2937 train_time:377493ms step_avg:125.83ms
step:3200/20000 train_loss:2.2490 train_time:399494ms step_avg:124.84ms
step:3400/20000 train_loss:2.0703 train_time:421448ms step_avg:123.96ms
step:3600/20000 train_loss:2.1784 train_time:443326ms step_avg:123.15ms
step:3800/20000 train_loss:2.1176 train_time:465282ms step_avg:122.44ms
step:4000/20000 train_loss:2.2336 train_time:487251ms step_avg:121.81ms
step:4000/20000 val_loss:2.1319 val_bpb:1.2626 train_time:487284ms step_avg:121.82ms
step:4200/20000 train_loss:2.1740 train_time:509717ms step_avg:121.36ms
step:4400/20000 train_loss:2.1136 train_time:531597ms step_avg:120.82ms
step:4600/20000 train_loss:2.1448 train_time:553497ms step_avg:120.33ms
step:4800/20000 train_loss:2.0699 train_time:575363ms step_avg:119.87ms
step:5000/20000 train_loss:2.1380 train_time:597196ms step_avg:119.44ms
step:5000/20000 val_loss:2.0730 val_bpb:1.2278 train_time:597227ms step_avg:119.45ms
step:5026/20000 val_loss:2.0715 val_bpb:1.2268 train_time:599996ms step_avg:119.38ms
stopping_early: wallclock_cap train_time:599996ms step:5026/20000
peak memory: 23774 MiB
Serialized model zlib-9: 15834135 bytes
Total submission size: 15878267 bytes
final_ternary_roundtrip val_loss:2.0720 val_bpb:1.2271 eval_time:3525ms
final_ternary_roundtrip_exact val_loss:2.07197067 val_bpb:1.22713774
final_sliding_window val_loss:2.0147 val_bpb:1.1932 stride:64 eval_time:137776ms
final_sliding_window_exact val_loss:2.01474326 val_bpb:1.19324548
Loading