Skip to content

[WIP] Non-record: Local Ablation Pipeline — EMA + Int6 + Partial RoPE (GTX 1650)#682

Open
gthgomez wants to merge 1 commit intoopenai:mainfrom
gthgomez:submission/local-ablation-gtx1650
Open

[WIP] Non-record: Local Ablation Pipeline — EMA + Int6 + Partial RoPE (GTX 1650)#682
gthgomez wants to merge 1 commit intoopenai:mainfrom
gthgomez:submission/local-ablation-gtx1650

Conversation

@gthgomez
Copy link

@gthgomez gthgomez commented Mar 25, 2026

Summary

This is a non-record local validation submission intended to document implementation and ablation results on constrained hardware. It is not a leaderboard attempt.

Track: track_non_record_16mb — dev hardware, not competition-scale
Author: Jonathan Gomez (gthgomez)
Hardware: NVIDIA GTX 1650 (4 GB VRAM, SM 7.5, Turing, Windows 11)

Folder: records/track_non_record_16mb/2026-03-25_LocalAblation_GTX1650_EMA_Int6_PartialRoPE/

What this includes

Features ported from leaderboard entries #1 (1.1233 bpb) and #2 (1.1248 bpb) and validated via 200-step ablation runs on local hardware:

  • GTX 1650 compatibility patchesNO_COMPILE, math SDP fallback, MAX_VAL_SEQS cap. All patches are env-var-gated and inert on H100 hardware.
  • EMA (EMA_DECAY env var) — 0.997 is the intended competition-scale setting based on top public entries; 0.97 validated correct implementation locally (+0.167 bpb improvement over live model in 200-step test).
  • Int6 clip-search quantizer — 5-percentile per-row search, values in [-31, 31], inline A/B comparison at export. Local result: 6.7 MB vs 11.0 MB int8 at +0.005 bpb cost. The reduction appears to come from lower dynamic range and increased weight regularity, which improves entropy coding efficiency under zlib. All measurements use the same export and compression path.
  • Partial RoPE (ROPE_DIMS=16) — rotate first 16/64 head dims, passthrough the rest.
  • LN Scale (LN_SCALE=1) — 1/sqrt(layer_idx+1) applied to attn+mlp norms per block.
  • Muon decoupled weight decay (MUON_WD=0.04) + AdamW (ADAM_WD=0.04) for tok/scalar optimizers.
  • MLP_MULT float support — enables MLP_MULT=3.0 (entry Update README.md little things #1 config).

Local ablation results (200 steps, 9L, GTX 1650)

Run Live bpb EMA bpb int8 size int6 size
Baseline 2.6964 ~11.1 MB ~7.0 MB
EMA_DECAY=0.97 2.6333 2.4661 11.1 MB
Partial RoPE + LN Scale 2.6845 11.2 MB 7.0 MB
All features combined 2.6845 2.5273 11.0 MB 6.7 MB

What is NOT in this submission

  • Full competition-scale training (11L, MLP_MULT=3.0, seq_len=2048, 7000 steps) — pending 8×H100 access
  • XSA (Exclusive Self-Attention)
  • VE (Value Embedding) — not yet implemented in this script
  • Sliding-window eval (stride-64)

This draft will be updated with competition-scale results once compute is available, or superseded by a ranked submission.

@gthgomez gthgomez force-pushed the submission/local-ablation-gtx1650 branch from d0dcb4c to 6b14fe3 Compare March 25, 2026 05:41
records/track_non_record_16mb/2026-03-25_LocalAblation_GTX1650_EMA_Int6_PartialRoPE/

Dev-hardware (GTX 1650, SM 7.5, 4 GB VRAM, Windows 11) pipeline porting
proven techniques from leaderboard entries openai#1 and openai#2 via 200-step local
ablation runs. Features implemented and validated:

- NO_COMPILE + math SDP fallback + MAX_VAL_SEQS (GTX 1650 compat, inert on H100)
- EMA (decay sweep: 0.997 for competition-scale, 0.97 validated locally)
- int6 clip-search quantizer + in-process A/B comparison
- Partial RoPE (ROPE_DIMS=16) + LN Scale 1/sqrt(layer+1)
- Muon decoupled weight decay (MUON_WD) + AdamW for tok/scalar
- MLP_MULT float support (enables MLP_MULT=3.0)

Best local result: val_bpb 2.5273 (int8 roundtrip, combined config, 200 steps)
Not a leaderboard attempt. Pending: full 11L competition run on 8xH100.
@gthgomez gthgomez force-pushed the submission/local-ablation-gtx1650 branch from 6b14fe3 to 8f16ebf Compare March 25, 2026 05:44
@gthgomez gthgomez marked this pull request as ready for review March 25, 2026 05:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant