diff --git a/records/track_10min_16mb/2026-03-24_Negative_Results_Hardware_Alignment/README.md b/records/track_10min_16mb/2026-03-24_Negative_Results_Hardware_Alignment/README.md new file mode 100644 index 000000000..a365e23ed --- /dev/null +++ b/records/track_10min_16mb/2026-03-24_Negative_Results_Hardware_Alignment/README.md @@ -0,0 +1,64 @@ +# Non-Record: Negative Results — Hardware Alignment & Quantization on 8×H100 + +30+ experiments attempting to improve the 11L d=512 transformer beyond PR #593 (1.1171 BPB). Every kernel optimization, quantization trick, and architectural change that did NOT help. + +**Base:** PR #593 — 1.1171 BPB, 83ms/step, 7189 steps, Parallel Muon + Full GPTQ, 8×H100 SXM + +--- + +## Kernel-Level Optimization (All Dead) + +| Approach | Result | Why It Failed | +|----------|--------|---------------| +| CUTLASS SM90 TMA+WGMMA GEMM | 2.5× slower than cuBLAS | cuBLAS heuristics beat default CUTLASS for 98304×512×1536. Built a working kernel — correct results, wrong speed. | +| Fused Triton GEMM + LeakyReLU² | 1.82× faster fwd, **2.7× slower** fwd+bwd | `torch.autograd.Function` bypasses Inductor. Backward runs in eager mode, 2-3× slower than Inductor's auto-generated Triton backward. | +| `torch.library.triton_op` for GEMM | Compile error | FakeTensor can't provide `data_ptr()` — GEMM kernels incompatible with triton_op tracing. | +| Custom CUDA C++ fused activation | 6% slower | PyTorch's `vectorized_elementwise_kernel` is already highly optimized for pointwise ops. | +| Fused norm+residual (Triton) | Ties torch.compile exactly | 0.136ms ours vs 0.136ms Inductor-generated. torch.compile already fuses this pattern. | +| FP8 training (TransformerEngine) | No speedup (90 vs 89ms) | At d=512, attention GEMMs are already memory-bound (AI=170-255). FP8 doubles peak FLOPS but also doubles the ridge point, making MORE ops memory-bound. | +| QKV fusion (8Q/4KV GQA) | 3-17% slower | Fused (512→1024) GEMM is slightly faster, but splitting output into non-contiguous Q(512)/K(256)/V(256) tensors costs more than the GEMM savings. | + +**Conclusion:** torch.compile (PyTorch 2.9.1) already fuses CE+softcap+tanh, LeakyReLU²+residual, RMSNorm+backward, and all pointwise chains. cuBLAS is at the hardware limit for K=512 (~48% roofline, pipeline depth limitation). The 82ms step is 95%+ optimized. + +## torch.compile Gotchas + +| Issue | Impact | Mechanism | +|-------|--------|-----------| +| Late QAT recompilation | OOM with larger models | Flipping `_qat_enabled` mid-training changes the forward graph → torch.compile recompiles → memory spike exceeds 80GB | +| `torch.autograd.Function` | 2-3× slower backward | Custom Functions bypass Inductor entirely. Backward runs uncompiled eager Python ops. | +| H100 memory compression | 25-50% inflated benchmarks | Synthetic data (cudaMemset, BlockFillRandom, zeros) compresses in HBM hardware. Only `torch.randn` gives real numbers. | + +## Quantization Experiments (Diminishing Returns) + +| Approach | BPB | Delta | Why It Failed | +|----------|-----|-------|---------------| +| SpinQuant (Hadamard rotation before GPTQ) | 1.1151 | −0.0002 | GPTQ's actorder + Cholesky already handles outliers. Rotation adds little on top. Artifact slightly larger (rotated weights compress worse). | +| Mixed-precision int5/int8 per-layer | 1.1209 | +0.006 | int5 (31 levels) is too coarse. Boundary layers at int8 can't compensate for middle layers losing half their precision. | +| Soft-Round QAT (differentiable rounding) | 1.1151 | −0.0002 | `soft_round(x,T) = x + 0.5*tanh(T*(x-round(x)-0.5))/tanh(T/2)`. Identical to standard STE — the ~500 QAT steps aren't enough for the temperature annealing to have effect. | +| Selective ±1 pruning at 28-37% | 1.1198-1.1204 | +0.004-0.005 | Too aggressive. Only <10% pruning is loss-neutral. The #609 approach works because their base artifact is smaller (BigramHash 2048). | + +## Architecture & Training (All Negative) + +| Approach | BPB | Delta | Why It Failed | +|----------|-----|-------|---------------| +| XSA on all 11 layers (vs last 4) | worse at 100s | +0.014 | 2.9ms/step overhead from 7 additional XSA ops. In our Parallel Muon stack, the slower step time costs more than XSA gains. Works in #609's stack but not ours. | +| Value Residual Learning | 1.1179 | +0.0008 | VRL conflicts with VE128 in our stack — both inject identity information into deep attention layers. Redundant. | +| Gated Attention | 1.1197 | +0.0026 | 4% slower step time (86.7 vs 83ms). Per-head sigmoid gates add overhead that isn't compensated by quality improvement. | +| Weight decay 0.08 (vs 0.04) | 1.1235 | +0.008 | Better at 100s (2.191 vs 2.207 val_loss), WORSE at 600s. Over-regularization prevents learning fine-grained patterns during warmdown. **Early loss does not predict final post-quant BPB.** | +| Batch size 1M tokens | 1.1197 | +0.003 | Fewer steps (5,526 vs 7,189) hurt more than better gradients help. At this scale, step count dominates. | +| Train bigger d=576 + int5 | 1.1233 | +0.006 | 110ms/step = 24% fewer steps. Scaling law gain (~0.019 from more params) can't compensate for 1,700 fewer training steps. | +| Shard ordering (hard→easy) | 1.1162 | +0.0009 | Per-shard loss spread is only 0.024 (0.3% relative). Ordering disrupts natural data diversity, net negative. | +| Legal TTT (22 experiments) | 1.1177 best | +0.0006 | Score-first constraint means model adapts too late — early tokens get no benefit. 400-1600s eval time for zero or negative gain. | +| Hessian all-reduce across GPUs | 1.1169 | −0.0002 | 256 calibration batches per GPU already provide sufficient Hessian statistics. | + +## Meta-Lessons + +1. **The step is 95%+ optimized.** torch.compile handles all fusion, cuBLAS is at hardware limit, FA3 already in use. No kernel-level headroom. + +2. **H100 is massively overprovisioned** for this model. 21.5GB of 80GB GPU used. 99% of NVLink idle. The hardware constraints don't bind — the 16MB artifact limit does. + +3. **The competition is bits-per-parameter, not FLOPS-per-second.** The quantization gap (0.022 BPB) is 10× larger than any kernel optimization. Reducing it is the only path. + +4. **Stale processes from nohup+torchrun** accumulate silently, causing 2-3× performance degradation and false experimental results. Always verify `nvidia-smi` shows 0 MiB before experiments. + +5. **Early training loss direction doesn't predict final BPB.** WD=0.08 looks better at 100s but worse at 600s after warmdown + EMA + GPTQ. Fast A/B tests can filter out clearly bad ideas but cannot confirm good ones.