Releases: HomebrewML/HeavyBall
SCION + Split Optimizer
- SplitOptimizer, following Andreas Kirsch's research on continual learning (https://x.com/BlackHC/status/2001961535120568542)
- Approximate SCION
- Numpy 2.0.0 support
Configurable Division
By default, HeavyBall's division differs from the industry standard, potentially giving meaningfully different results for otherwise identical optimizer hyperparameters.
You can now set heavyball.utils.default_division_backend to one of
eps_clamp, HeavyBall's default (x / y.clamp(min=eps))eps_add, Standard, used by PyTorch, Optax and others (x / (y + eps))atan2, following Adam-Atan2 (atan2(x / scale, y) * scale) - may requireheavyball.utils.atan2_scaleto be to clamp to a different range of target valuesnan_to_0, resulting in(x/y).nan_to_num(0, 0, 0)
NAdam + AdEMAMix
This release focusses on adding new optimizers
- NAdam (following @tom-jod's research)
- AdEMAMix
SOAPNAdam- SOAP with NAdam in the eigenbasisSOAPAdEMAMix- SOAP with AdEMAMix in the eigenbasis
Note that this changes the previous SOAP infrastructure. SOAP variants manually created for the previous version will not work out of the box, but can be trivially converted.
Bugfixes, Memory reduction, Save/Restore
- torch autocasts psgd's internal step from int64->fp64, which caused a mismatch in states before/after loading
- an unbounded lru_cache, used to speed up parameter accesses, may have kept parameters around indefinitely
- with psgd, caution=True and foreach=False, caution was only applied on the first parameter
- psgd quad with bf16 parameters tried to multiply a bf16 matrix with an fp32 matrix
- psgd's
preconditioner_update_probabilitywas ignored if set to 0 - resulting in the default schedule being used - not all optimizers were exposed in
__all__ - psgd's scheduler was not stepped properly, causing the scheduler to remain at 100% precond update probability
- msign/thinky_polar_express (a zeroth-power backend) always returned fp32 tensors, where it should've been adaptive to the input dtype
- soap's initial eigh did not put the tensors back into their original dtype
- the built-in ema exited after the first empty parameter group, potentially skipping updates
- the built-in ema was updated once for each parameter group, causing different effective ema betas for different param group counts
- finite differences hvp did not divide by the epsilon scaling
- pointwise_lr_adaptation called lr_adaptation - not pointwise_lr_adaptation
- fused_hook may have processed 1-tensor models incorrectly
Stable Muon
Stability, Stability, Stability (and bug-fixes)
- Higher numerical stability in Newton-Schulz orthogonalization (affects: Muon)
- Higher accuracy in SVD computation (affects: PSGD)
- Advanced checkpointing (affects: old checkpoints, SOAP and PSGD)
- Reworked chainable backend, allowing more freedom in function composition (affects: custom optimizers)
For the full release notes and migration instructions, see here
- Benchmark
- Optimizers
- AdamC implemented (#65, by @Ryu1845 & #77, by @drexalt)
- Fixed all clipping algorithms with non-default clipping thresholds, add clipping tests (#73 & #74, by @alexjwilliams)
- More accurate NS5 iterations, backporting Muon research (#69, by @xTimeCrystal)
Fixed SOAP, HVP PSGD
Bugfixes:
- @francois-rozet fixed a severe convergence regression in SOAP. It's now faster and converges better than before (#42)
- ADOPT now correctly matches the paper, significantly improving its convergence
- FP64 storage and/or computation now works for more optimizers
Improvements:
- NewtonPSGD now supports exact HVP calculation instead of the previous approximate. (Handles BatchNorm better but doesn't support all architectures.)
"smart_one_diag"is a next-to-no-downsidesmemory_save_modefor PSGD. It reduces memory and compute cost compared tomemory_save_mode=Noneand improves convergence compared tomemory_save_mode="one_diag"*
*Instead of preconditioning all dimensions (memory_save_mode=None) or preconditioning all but the largest dimension (memory_save_mode="one_diag") we remove the largest dimension iff it's larger than the second largest. So, a Linear(128, 1024) will now create one 128x128 preconditioner (instead of 128x128 + 1024x1024, 8x as large as the parameters), while a Linear(128, 128) can still benefit from preconditioning both sides.
OrthoGrad & PSGD improvements
- General
precond_schedulematches its docs (@francois-rozet, #31)- unified warmup_steps API (@francois-rozet, #32 )
- add
epsarg toscale_by_adam(#33) - allow external management of LR (for
foreach=Trueoptimizers)
- OrthoGrad, a "grokking-first" optimizer that works
- PSGD
- no more OOM in
torch.linalg.solve - speed up cache by skipping it when it wouldn't give speedups
- add newton-PSGD ("hvp-PSGD") using finite-difference approximation
- caution momentum, not update (-> improved convergence; closer to paper's intention)
- no more OOM in
- Benchmarks
grokkingbenchmark, using modular addition and wide MLPs
Fix PSGD, spring cleaning
- Previously, only the first parameter of PSGD was trained; This is fixed now
- All PSGDs were
PurePSGD- nowmomentum_into_precond_updateandexp_avg_inputhave their expected effect again - preliminary support for external changes of
group['lr']
