Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# Wavelet-Lite PR549 Parallel Muon

**val_bpb: 1.1483** | **15.86 MB** | **8x H100 80GB** | **90.24 ms/step**

This is a **non-record 10min/16MB submission**. It does not beat the current SOTA, but it is under the official cap and materially different from the merged wavelet- and routing-adjacent entries already in the repo.

## Idea

This submission takes the strong PR `#549` Parallel Muon frontier stack and adds one science-flavored architectural change: a tiny causal wavelet-lite mixer inside each residual block.

- The first `16` post-attention activation channels are split into low/high Haar-style bands using the current token and a one-token lagged copy.
- A learned low-band drift scale perturbs only the coarse band before the transform is folded back.
- To stay comfortably under the 16 MB cap, the run uses `BIGRAM_VOCAB_SIZE=1024` and disables TTT in the final budgeted training recipe.

This is a derivative frontier stack, but the added mechanism is architectural rather than a retune.

## Why this is not duplicate work

Nearest prior PRs and exact differences:

1. [PR #549](https://github.com/openai/parameter-golf/pull/549) `LeakyReLU² + Legal Score-First TTT + Parallel Muon`
This is the closest parent stack. The present submission adds a new causal wavelet mixer inside the model, removes TTT from the final run, and trims the bigram table to fit the byte budget. It is not a rename or pure hyperparameter sweep of `#549`.
2. [PR #211](https://github.com/openai/parameter-golf/pull/211) `WaveletWeightedWidenet`
That work is a wavelet-branded widen/compress/VQ design. This submission does not widen the network or introduce VQ compression; it changes activation-space token mixing inside the residual stream.
3. [PR #632](https://github.com/openai/parameter-golf/pull/632) `Attention-Residuals`
That work changes depthwise residual routing over layer history. This submission leaves depth routing alone and injects local multiresolution token mixing inside each block.
4. [PR #507](https://github.com/openai/parameter-golf/pull/507) `11L U-Net + Catalytic + SwiGLU + SW64`
That work is built around U-Net-style skip structure. This submission adds no U-Net transport or skip gating.
5. [PR #530](https://github.com/openai/parameter-golf/pull/530) `Basis Block Interpolation`
That work reuses/interpolates depth blocks. This submission does not interpolate parameters across layers; it adds a fixed-form token transform inside each block.

## Final result

Final 8xH100 run:

- Commit synced to pod: `9c0eba6`
- `step:6648/9000 val_bpb=1.1409`
- `DIAGNOSTIC post_ema val_bpb=1.1400`
- `step_avg=90.24 ms/step`
- `peak memory allocated=22015 MiB`

Recovered final artifact:

- `final_model.int6.ptz`: `15,768,240` bytes
- Code size: `91,471` bytes
- Total submission size: `15,859,711` bytes
- Exact saved-artifact roundtrip: `val_bpb=1.14825550`

Why this counts as a solid entry:

- It is under the artifact cap by `140,289` bytes.
- It beats the best existing `1.15015359` / `1.1556` / `1.15744040` local merged results on the fetched `upstream/main` tree.
- Against `upstream/main` fetched on **March 25, 2026**, it would rank **7th** among merged non-null `track_10min_16mb` submission scores.

## Throughput and artifact notes

- Full training used the canonical cached `sp1024` setup on `8x H100` with `MAX_WALLCLOCK_SECONDS=600`.
- The decisive systems trick was not model-side: logs and artifacts stayed on the MO-1 volume, but data and tokenizer were copied onto local `/workspace` NVMe before training.
- The training pod exported the full-precision checkpoint; the int6 artifact and exact roundtrip eval were then recovered from the saved checkpoint on a short-lived 1xH100 helper pod.

## What failed along the way

- The first PR549 wavelet attempt (`8deb532`) hit `post_ema val_bpb=1.1439` but missed the size cap at `16,052,223` bytes and also crashed in the quantized eval path.
- An ephemeral under-cap rerun proved the speed path (`~90.40 ms/step`) but vanished before producing a durable artifact.
- A volume-only rerun was durable but too slow (`126.64 ms/step`) because it trained directly from the network volume.

## Files included here

- `train_gpt.py`: exact training script
- `final_model.int6.ptz`: final saved artifact
- `logs/train.log`: 8xH100 training log
- `logs/roundtrip_eval.log`: exact 1xH100 int6 roundtrip eval log
- `results.tsv`: local experiment ledger snapshot

## Run command

```bash
RUN_ID=pr549_wavelet_bigram1024_nottt_8xh100_nvme_seed1337 \
DATA_PATH=/workspace/parameter-golf/data/datasets/fineweb10B_sp1024/ \
TOKENIZER_PATH=/workspace/parameter-golf/data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
SEED=1337 \
NUM_LAYERS=11 \
BIGRAM_VOCAB_SIZE=1024 \
XSA_LAST_N=4 \
SWA_ENABLED=1 \
SWA_EVERY=50 \
ROPE_DIMS=16 \
LN_SCALE=1 \
LATE_QAT_THRESHOLD=0.15 \
VE_ENABLED=1 \
VE_DIM=128 \
VE_LAYERS=9,10 \
TTT_ENABLED=0 \
MUON_WD=0.04 \
ADAM_WD=0.04 \
MATRIX_LR=0.025 \
SCALAR_LR=0.025 \
TIED_EMBED_LR=0.035 \
MUON_MOMENTUM=0.99 \
MUON_MOMENTUM_WARMUP_START=0.92 \
MUON_MOMENTUM_WARMUP_STEPS=1500 \
WARMDOWN_ITERS=3500 \
ITERATIONS=9000 \
MAX_WALLCLOCK_SECONDS=600 \
EVAL_STRIDE=64 \
QAT_ENABLED=1 \
GATED_ATTENTION=1 \
VALUE_RESIDUAL=1 \
WAVELET_ENABLED=1 \
WAVELET_DIM=16 \
WAVELET_INIT=0.25 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{'stage': 'load_val_tokens', 'val_seq_len': 2048}
{'stage': 'build_luts'}
{'stage': 'load_states'}
{'stage': 'build_model'}
{'artifact_bytes': 15768240, 'code_bytes': 91471, 'total_bytes': 15859711}
final_int6_roundtrip val_loss:1.9388 val_bpb:1.1483 eval_time:48740ms
final_int6_roundtrip_exact val_loss:1.93878131 val_bpb:1.14825550
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
W0325 04:36:33.515000 376 torch/distributed/run.py:803]
W0325 04:36:33.515000 376 torch/distributed/run.py:803] *****************************************
W0325 04:36:33.515000 376 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0325 04:36:33.515000 376 torch/distributed/run.py:803] *****************************************
logs/pr549_wavelet_bigram1024_nottt_8xh100_nvme_seed1337.txt
val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=/workspace/parameter-golf/data/tokenizers/fineweb_1024_bpe.model
train_loader:dataset:fineweb10B_sp1024 train_shards:80
val_loader:shards pattern=/workspace/parameter-golf/data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
model_params:26908026
mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0
XSA:last_4 active_layers:[7, 8, 9, 10]
wavelet:enabled:True dim:16 init:0.250
world_size:8 grad_accum_steps:1
sdp_backends:cudnn=False flash=True mem_efficient=False math=False
attention_mode:gqa num_heads:8 num_kv_heads:4
tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025
train_batch_tokens:786432 train_seq_len:2048 iterations:9000 warmup_steps:20 max_wallclock_seconds:600.000
seed:1337
warmup_step:1/20
warmup_step:2/20
warmup_step:3/20
warmup_step:4/20
warmup_step:5/20
warmup_step:6/20
warmup_step:7/20
warmup_step:8/20
warmup_step:9/20
warmup_step:10/20
warmup_step:11/20
warmup_step:12/20
warmup_step:13/20
warmup_step:14/20
warmup_step:15/20
warmup_step:16/20
warmup_step:17/20
warmup_step:18/20
warmup_step:19/20
warmup_step:20/20
step:0/9000 val_loss:6.9282 val_bpb:4.1033 train_time:0ms step_avg:0.02ms
step:1/9000 train_loss:6.9303 train_time:143ms step_avg:142.64ms
step:2/9000 train_loss:8.4861 train_time:252ms step_avg:125.85ms
step:3/9000 train_loss:7.6009 train_time:375ms step_avg:125.14ms
step:4/9000 train_loss:7.2399 train_time:502ms step_avg:125.41ms
step:5/9000 train_loss:7.2152 train_time:627ms step_avg:125.43ms
step:6/9000 train_loss:7.1867 train_time:740ms step_avg:123.26ms
step:7/9000 train_loss:7.2223 train_time:847ms step_avg:120.99ms
step:8/9000 train_loss:7.1872 train_time:949ms step_avg:118.61ms
step:9/9000 train_loss:6.7762 train_time:1059ms step_avg:117.61ms
step:10/9000 train_loss:6.3780 train_time:1169ms step_avg:116.89ms
step:500/9000 train_loss:2.3737 train_time:45085ms step_avg:90.17ms
step:1000/9000 train_loss:2.2574 train_time:90025ms step_avg:90.02ms
step:1500/9000 train_loss:2.2091 train_time:134961ms step_avg:89.97ms
step:2000/9000 train_loss:2.0543 train_time:179978ms step_avg:89.99ms
step:2500/9000 train_loss:2.1546 train_time:225071ms step_avg:90.03ms
step:3000/9000 train_loss:2.1506 train_time:270155ms step_avg:90.05ms
step:3500/9000 train_loss:2.1658 train_time:315238ms step_avg:90.07ms
step:4000/9000 train_loss:1.9544 train_time:360297ms step_avg:90.07ms
step:4000/9000 val_loss:2.0443 val_bpb:1.2108 train_time:360398ms step_avg:90.10ms
step:4500/9000 train_loss:2.1037 train_time:405386ms step_avg:90.09ms
step:5000/9000 train_loss:2.0855 train_time:450435ms step_avg:90.09ms
step:5500/9000 train_loss:2.0002 train_time:495527ms step_avg:90.10ms
swa:start step:6000
step:6000/9000 train_loss:1.9221 train_time:540557ms step_avg:90.09ms
step:6500/9000 train_loss:2.0588 train_time:586379ms step_avg:90.21ms
step:6648/9000 val_loss:1.9263 val_bpb:1.1409 train_time:599929ms step_avg:90.24ms
stopping_early: wallclock_cap train_time:599929ms step:6648/9000
peak memory allocated: 22015 MiB reserved: 22758 MiB
ema:applying EMA weights
DIAGNOSTIC post_ema val_loss:1.9249 val_bpb:1.1400 eval_time:2099ms
Serialized model: 106091301 bytes
Code size: 91471 bytes
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
commit val_bpb artifact_mb ms_per_step status description
047f65f 1.328871 13.840514 435.58 keep untouched baseline on 1x H100 Runpod stage A (full cached sp1024, wallclock-capped at step 1378)
eacd422 0.000000 0.0 0.0 crash tiny hyper-connections first run crashed at warmup with DDP unused-parameter error
69f8121 1.547943 9.491089 1488.98 discard tiny hyper-connections rerun: valid artifact but 3.42x slower and much worse BPB than untouched baseline
6f0dbdd 1.331192 13.384367 433.22 keep wavelet-lite full 1xH100 run: near-baseline quality with slightly smaller artifact and comparable throughput
39f9171 1.334701 13.470763 455.21 discard late-wavelet decoder-half schedule: slightly worse BPB than all-layer wavelet with no throughput win
01c8861 1.320934 13.563465 407.70 keep wavelet-lite dim64 full 1xH100 run: beats untouched baseline while staying under 16MB
bcdf894 1.207600 0.0 110.77 discard pr414 frontier wavelet on 8xH100: killed early after step4000 val 1.2076, far off top-ten pace
8deb532 1.143900 16.052223 94.54 discard pr549 frontier wavelet on 8xH100: post-EMA 1.1439 at step6347, but total size was 16052223 bytes (> cap by 52223) and quantized eval crashed from missing wavelet args in eval_model
f14f193 0.000000 0.0 0.0 crash pr549 wavelet bigram1024 no-TTT on ephemeral 8xH100: healthy pace to step3000 at 90.40ms/step, then pod vanished from Runpod control plane before first val checkpoint or final export
f14f193 0.000000 0.0 0.0 crash pr549 wavelet bigram1024 no-TTT on MO-1 8xH100 volume mount: log persisted, but step500 was only 126.64ms/step and the pod vanished before any mid-run val or final export
9c0eba6 1.148256 15.859711 90.24 keep pr549 wavelet bigram1024 no-TTT on MO-1 8xH100 with local NVMe data: post-EMA 1.1400 at step6648, exact int6 roundtrip 1.14825550, total size 15859711 bytes
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{
"name": "Wavelet-Lite PR549 Parallel Muon",
"val_bpb": 1.1482555,
"bytes_total": 15859711,
"blurb": "PR549-derived Parallel Muon frontier stack plus a tiny causal wavelet-lite activation mixer, with BIGRAM_VOCAB_SIZE=1024 and no TTT in the final budgeted run. Exact saved int6 artifact roundtrip: val_bpb=1.1483 at 15.86 MB total.",
"author": "Omar Habra",
"github_id": "bro4all",
"date": "2026-03-24"
}
Loading