openai · bro4all · Mar 25, 2026
diff --git a/records/track_10min_16mb/2026-03-24_WaveletLite_PR549_ParallelMuon/README.md b/records/track_10min_16mb/2026-03-24_WaveletLite_PR549_ParallelMuon/README.md
@@ -0,0 +1,114 @@
+# Wavelet-Lite PR549 Parallel Muon
+
+**val_bpb: 1.1483** | **15.86 MB** | **8x H100 80GB** | **90.24 ms/step**
+
+This is a **non-record 10min/16MB submission**. It does not beat the current SOTA, but it is under the official cap and materially different from the merged wavelet- and routing-adjacent entries already in the repo.
+
+## Idea
+
+This submission takes the strong PR `#549` Parallel Muon frontier stack and adds one science-flavored architectural change: a tiny causal wavelet-lite mixer inside each residual block.
+
+- The first `16` post-attention activation channels are split into low/high Haar-style bands using the current token and a one-token lagged copy.
+- A learned low-band drift scale perturbs only the coarse band before the transform is folded back.
+- To stay comfortably under the 16 MB cap, the run uses `BIGRAM_VOCAB_SIZE=1024` and disables TTT in the final budgeted training recipe.
+
+This is a derivative frontier stack, but the added mechanism is architectural rather than a retune.
+
+## Why this is not duplicate work
+
+Nearest prior PRs and exact differences:
+
+1. [PR #549](https://github.com/openai/parameter-golf/pull/549) `LeakyReLU² + Legal Score-First TTT + Parallel Muon`
+   This is the closest parent stack. The present submission adds a new causal wavelet mixer inside the model, removes TTT from the final run, and trims the bigram table to fit the byte budget. It is not a rename or pure hyperparameter sweep of `#549`.
+2. [PR #211](https://github.com/openai/parameter-golf/pull/211) `WaveletWeightedWidenet`
+   That work is a wavelet-branded widen/compress/VQ design. This submission does not widen the network or introduce VQ compression; it changes activation-space token mixing inside the residual stream.
+3. [PR #632](https://github.com/openai/parameter-golf/pull/632) `Attention-Residuals`
+   That work changes depthwise residual routing over layer history. This submission leaves depth routing alone and injects local multiresolution token mixing inside each block.
+4. [PR #507](https://github.com/openai/parameter-golf/pull/507) `11L U-Net + Catalytic + SwiGLU + SW64`
+   That work is built around U-Net-style skip structure. This submission adds no U-Net transport or skip gating.
+5. [PR #530](https://github.com/openai/parameter-golf/pull/530) `Basis Block Interpolation`
+   That work reuses/interpolates depth blocks. This submission does not interpolate parameters across layers; it adds a fixed-form token transform inside each block.
+
+## Final result
+
+Final 8xH100 run:
+
+- Commit synced to pod: `9c0eba6`
+- `step:6648/9000 val_bpb=1.1409`
+- `DIAGNOSTIC post_ema val_bpb=1.1400`
+- `step_avg=90.24 ms/step`
+- `peak memory allocated=22015 MiB`
+
+Recovered final artifact:
+
+- `final_model.int6.ptz`: `15,768,240` bytes
+- Code size: `91,471` bytes
+- Total submission size: `15,859,711` bytes
+- Exact saved-artifact roundtrip: `val_bpb=1.14825550`
+
+Why this counts as a solid entry:
+
+- It is under the artifact cap by `140,289` bytes.
+- It beats the best existing `1.15015359` / `1.1556` / `1.15744040` local merged results on the fetched `upstream/main` tree.
+- Against `upstream/main` fetched on **March 25, 2026**, it would rank **7th** among merged non-null `track_10min_16mb` submission scores.
+
+## Throughput and artifact notes
+
+- Full training used the canonical cached `sp1024` setup on `8x H100` with `MAX_WALLCLOCK_SECONDS=600`.
+- The decisive systems trick was not model-side: logs and artifacts stayed on the MO-1 volume, but data and tokenizer were copied onto local `/workspace` NVMe before training.
+- The training pod exported the full-precision checkpoint; the int6 artifact and exact roundtrip eval were then recovered from the saved checkpoint on a short-lived 1xH100 helper pod.
+
+## What failed along the way
+
+- The first PR549 wavelet attempt (`8deb532`) hit `post_ema val_bpb=1.1439` but missed the size cap at `16,052,223` bytes and also crashed in the quantized eval path.
+- An ephemeral under-cap rerun proved the speed path (`~90.40 ms/step`) but vanished before producing a durable artifact.
+- A volume-only rerun was durable but too slow (`126.64 ms/step`) because it trained directly from the network volume.
+
+## Files included here
+
+- `train_gpt.py`: exact training script
+- `final_model.int6.ptz`: final saved artifact
+- `logs/train.log`: 8xH100 training log
+- `logs/roundtrip_eval.log`: exact 1xH100 int6 roundtrip eval log
+- `results.tsv`: local experiment ledger snapshot
+
+## Run command
+
+```bash
+RUN_ID=pr549_wavelet_bigram1024_nottt_8xh100_nvme_seed1337 \
+DATA_PATH=/workspace/parameter-golf/data/datasets/fineweb10B_sp1024/ \
+TOKENIZER_PATH=/workspace/parameter-golf/data/tokenizers/fineweb_1024_bpe.model \
+VOCAB_SIZE=1024 \
+SEED=1337 \
+NUM_LAYERS=11 \
+BIGRAM_VOCAB_SIZE=1024 \
+XSA_LAST_N=4 \
+SWA_ENABLED=1 \
+SWA_EVERY=50 \
+ROPE_DIMS=16 \
+LN_SCALE=1 \
+LATE_QAT_THRESHOLD=0.15 \
+VE_ENABLED=1 \
+VE_DIM=128 \
+VE_LAYERS=9,10 \
+TTT_ENABLED=0 \
+MUON_WD=0.04 \
+ADAM_WD=0.04 \
+MATRIX_LR=0.025 \
+SCALAR_LR=0.025 \
+TIED_EMBED_LR=0.035 \
+MUON_MOMENTUM=0.99 \
+MUON_MOMENTUM_WARMUP_START=0.92 \
+MUON_MOMENTUM_WARMUP_STEPS=1500 \
+WARMDOWN_ITERS=3500 \
+ITERATIONS=9000 \
+MAX_WALLCLOCK_SECONDS=600 \
+EVAL_STRIDE=64 \
+QAT_ENABLED=1 \
+GATED_ATTENTION=1 \
+VALUE_RESIDUAL=1 \
+WAVELET_ENABLED=1 \
+WAVELET_DIM=16 \
+WAVELET_INIT=0.25 \
+torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
diff --git a/records/track_10min_16mb/2026-03-24_WaveletLite_PR549_ParallelMuon/final_model.int6.ptz b/records/track_10min_16mb/2026-03-24_WaveletLite_PR549_ParallelMuon/final_model.int6.ptz
diff --git a/records/track_10min_16mb/2026-03-24_WaveletLite_PR549_ParallelMuon/logs/roundtrip_eval.log b/records/track_10min_16mb/2026-03-24_WaveletLite_PR549_ParallelMuon/logs/roundtrip_eval.log
@@ -0,0 +1,7 @@
+{'stage': 'load_val_tokens', 'val_seq_len': 2048}
+{'stage': 'build_luts'}
+{'stage': 'load_states'}
+{'stage': 'build_model'}
+{'artifact_bytes': 15768240, 'code_bytes': 91471, 'total_bytes': 15859711}
+final_int6_roundtrip val_loss:1.9388 val_bpb:1.1483 eval_time:48740ms
+final_int6_roundtrip_exact val_loss:1.93878131 val_bpb:1.14825550
diff --git a/records/track_10min_16mb/2026-03-24_WaveletLite_PR549_ParallelMuon/logs/train.log b/records/track_10min_16mb/2026-03-24_WaveletLite_PR549_ParallelMuon/logs/train.log
@@ -0,0 +1,71 @@
+W0325 04:36:33.515000 376 torch/distributed/run.py:803] 
+W0325 04:36:33.515000 376 torch/distributed/run.py:803] *****************************************
+W0325 04:36:33.515000 376 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
+W0325 04:36:33.515000 376 torch/distributed/run.py:803] *****************************************
+logs/pr549_wavelet_bigram1024_nottt_8xh100_nvme_seed1337.txt
+val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=/workspace/parameter-golf/data/tokenizers/fineweb_1024_bpe.model
+train_loader:dataset:fineweb10B_sp1024 train_shards:80
+val_loader:shards pattern=/workspace/parameter-golf/data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
+model_params:26908026
+mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0
+XSA:last_4 active_layers:[7, 8, 9, 10]
+wavelet:enabled:True dim:16 init:0.250
+world_size:8 grad_accum_steps:1
+sdp_backends:cudnn=False flash=True mem_efficient=False math=False
+attention_mode:gqa num_heads:8 num_kv_heads:4
+tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025
+train_batch_tokens:786432 train_seq_len:2048 iterations:9000 warmup_steps:20 max_wallclock_seconds:600.000
+seed:1337
+warmup_step:1/20
+warmup_step:2/20
+warmup_step:3/20
+warmup_step:4/20
+warmup_step:5/20
+warmup_step:6/20
+warmup_step:7/20
+warmup_step:8/20
+warmup_step:9/20
+warmup_step:10/20
+warmup_step:11/20
+warmup_step:12/20
+warmup_step:13/20
+warmup_step:14/20
+warmup_step:15/20
+warmup_step:16/20
+warmup_step:17/20
+warmup_step:18/20
+warmup_step:19/20
+warmup_step:20/20
+step:0/9000 val_loss:6.9282 val_bpb:4.1033 train_time:0ms step_avg:0.02ms
+step:1/9000 train_loss:6.9303 train_time:143ms step_avg:142.64ms
+step:2/9000 train_loss:8.4861 train_time:252ms step_avg:125.85ms
+step:3/9000 train_loss:7.6009 train_time:375ms step_avg:125.14ms
+step:4/9000 train_loss:7.2399 train_time:502ms step_avg:125.41ms
+step:5/9000 train_loss:7.2152 train_time:627ms step_avg:125.43ms
+step:6/9000 train_loss:7.1867 train_time:740ms step_avg:123.26ms
+step:7/9000 train_loss:7.2223 train_time:847ms step_avg:120.99ms
+step:8/9000 train_loss:7.1872 train_time:949ms step_avg:118.61ms
+step:9/9000 train_loss:6.7762 train_time:1059ms step_avg:117.61ms
+step:10/9000 train_loss:6.3780 train_time:1169ms step_avg:116.89ms
+step:500/9000 train_loss:2.3737 train_time:45085ms step_avg:90.17ms
+step:1000/9000 train_loss:2.2574 train_time:90025ms step_avg:90.02ms
+step:1500/9000 train_loss:2.2091 train_time:134961ms step_avg:89.97ms
+step:2000/9000 train_loss:2.0543 train_time:179978ms step_avg:89.99ms
+step:2500/9000 train_loss:2.1546 train_time:225071ms step_avg:90.03ms
+step:3000/9000 train_loss:2.1506 train_time:270155ms step_avg:90.05ms
+step:3500/9000 train_loss:2.1658 train_time:315238ms step_avg:90.07ms
+step:4000/9000 train_loss:1.9544 train_time:360297ms step_avg:90.07ms
+step:4000/9000 val_loss:2.0443 val_bpb:1.2108 train_time:360398ms step_avg:90.10ms
+step:4500/9000 train_loss:2.1037 train_time:405386ms step_avg:90.09ms
+step:5000/9000 train_loss:2.0855 train_time:450435ms step_avg:90.09ms
+step:5500/9000 train_loss:2.0002 train_time:495527ms step_avg:90.10ms
+swa:start step:6000
+step:6000/9000 train_loss:1.9221 train_time:540557ms step_avg:90.09ms
+step:6500/9000 train_loss:2.0588 train_time:586379ms step_avg:90.21ms
+step:6648/9000 val_loss:1.9263 val_bpb:1.1409 train_time:599929ms step_avg:90.24ms
+stopping_early: wallclock_cap train_time:599929ms step:6648/9000
+peak memory allocated: 22015 MiB reserved: 22758 MiB
+ema:applying EMA weights
+DIAGNOSTIC post_ema val_loss:1.9249 val_bpb:1.1400 eval_time:2099ms
+Serialized model: 106091301 bytes
+Code size: 91471 bytes
diff --git a/records/track_10min_16mb/2026-03-24_WaveletLite_PR549_ParallelMuon/results.tsv b/records/track_10min_16mb/2026-03-24_WaveletLite_PR549_ParallelMuon/results.tsv
@@ -0,0 +1,12 @@
+commit	val_bpb	artifact_mb	ms_per_step	status	description
+047f65f	1.328871	13.840514	435.58	keep	untouched baseline on 1x H100 Runpod stage A (full cached sp1024, wallclock-capped at step 1378)
+eacd422	0.000000	0.0	0.0	crash	tiny hyper-connections first run crashed at warmup with DDP unused-parameter error
+69f8121	1.547943	9.491089	1488.98	discard	tiny hyper-connections rerun: valid artifact but 3.42x slower and much worse BPB than untouched baseline
+6f0dbdd	1.331192	13.384367	433.22	keep	wavelet-lite full 1xH100 run: near-baseline quality with slightly smaller artifact and comparable throughput
+39f9171	1.334701	13.470763	455.21	discard	late-wavelet decoder-half schedule: slightly worse BPB than all-layer wavelet with no throughput win
+01c8861	1.320934	13.563465	407.70	keep	wavelet-lite dim64 full 1xH100 run: beats untouched baseline while staying under 16MB
+bcdf894	1.207600	0.0	110.77	discard	pr414 frontier wavelet on 8xH100: killed early after step4000 val 1.2076, far off top-ten pace
+8deb532	1.143900	16.052223	94.54	discard	pr549 frontier wavelet on 8xH100: post-EMA 1.1439 at step6347, but total size was 16052223 bytes (> cap by 52223) and quantized eval crashed from missing wavelet args in eval_model
+f14f193	0.000000	0.0	0.0	crash	pr549 wavelet bigram1024 no-TTT on ephemeral 8xH100: healthy pace to step3000 at 90.40ms/step, then pod vanished from Runpod control plane before first val checkpoint or final export
+f14f193	0.000000	0.0	0.0	crash	pr549 wavelet bigram1024 no-TTT on MO-1 8xH100 volume mount: log persisted, but step500 was only 126.64ms/step and the pod vanished before any mid-run val or final export
+9c0eba6	1.148256	15.859711	90.24	keep	pr549 wavelet bigram1024 no-TTT on MO-1 8xH100 with local NVMe data: post-EMA 1.1400 at step6648, exact int6 roundtrip 1.14825550, total size 15859711 bytes
diff --git a/records/track_10min_16mb/2026-03-24_WaveletLite_PR549_ParallelMuon/submission.json b/records/track_10min_16mb/2026-03-24_WaveletLite_PR549_ParallelMuon/submission.json
@@ -0,0 +1,9 @@
+{
+  "name": "Wavelet-Lite PR549 Parallel Muon",
+  "val_bpb": 1.1482555,
+  "bytes_total": 15859711,
+  "blurb": "PR549-derived Parallel Muon frontier stack plus a tiny causal wavelet-lite activation mixer, with BIGRAM_VOCAB_SIZE=1024 and no TTT in the final budgeted run. Exact saved int6 artifact roundtrip: val_bpb=1.1483 at 15.86 MB total.",
+  "author": "Omar Habra",
+  "github_id": "bro4all",
+  "date": "2026-03-24"
+}