feat: U-Net Skip Connections + Drop-Muon optimization#94
feat: U-Net Skip Connections + Drop-Muon optimization#94batuhanozkose wants to merge 2 commits intoOpen-Superintelligence-Lab:mainfrom
Conversation
|
This is not apples to apples comparison. This is a big deal when it comes to loss metrics. Cosine decay is the correct thing to do for training, but we want a sane constant for the small testing. This is one of the problems of the evaluating this sort of thing. There are a lot of knobs that we make decisions on that are very messy. For some changes to be viable these parameters need to be fuzzed. This is also extremely sensitive to the LR warmup and decay behaviors. We want to keep the warmup and decay behaviors as simple as possible, as these other hyperparameters (and adamw_LR) might need to be changed through fuzzing. If you look in the discussions I ran a very long fuzzing that yielded much stronger numbers for this btw. The actual forward pass changes look good. These changes to the muon optimizer are a potentially massive footgun of optimization pressure build up and divergence. This is specific to what you're doing. You need to train this out to 100m tokens and show that it is stable. If it were me I would run 1k steps and measure floating point divergence over time to empirically prove that it won't destabilize training 1b tokens in.This is still changing num_works in train_llm.py. This is bad unless 2 was just unstable.You are still making optimizations to torch compile while making all these other changes. This is not an apples to apples comparison This is 5 different things you're touching at once and trying to say it gave improvement. You may find that some of it is hurting performance, it has to be tested one at a time and PRed one at a time. |
What I changed
Added U-Net skip connections and Drop-Muon optimization.
Changes:
Step-by-step progression:
Each change was validated before moving to next. The PR shows final state but I have full experiment log
Benchmark Results (RTX 4090, 8M tokens, 7 eval milestones)