Skip to content

Releases: HomebrewML/HeavyBall

SCION + Split Optimizer

20 Dec 12:08

Choose a tag to compare

Configurable Division

15 Oct 21:19
182bea0

Choose a tag to compare

By default, HeavyBall's division differs from the industry standard, potentially giving meaningfully different results for otherwise identical optimizer hyperparameters.

You can now set heavyball.utils.default_division_backend to one of

  • eps_clamp, HeavyBall's default (x / y.clamp(min=eps))
  • eps_add, Standard, used by PyTorch, Optax and others (x / (y + eps))
  • atan2, following Adam-Atan2 (atan2(x / scale, y) * scale) - may require heavyball.utils.atan2_scale to be to clamp to a different range of target values
  • nan_to_0, resulting in (x/y).nan_to_num(0, 0, 0)

NAdam + AdEMAMix

12 Oct 21:25
599189c

Choose a tag to compare

This release focusses on adding new optimizers

  • NAdam (following @tom-jod's research)
  • AdEMAMix
  • SOAPNAdam - SOAP with NAdam in the eigenbasis
  • SOAPAdEMAMix - SOAP with AdEMAMix in the eigenbasis

Note that this changes the previous SOAP infrastructure. SOAP variants manually created for the previous version will not work out of the box, but can be trivially converted.

Bugfixes, Memory reduction, Save/Restore

11 Oct 22:58
4e12207

Choose a tag to compare

  • torch autocasts psgd's internal step from int64->fp64, which caused a mismatch in states before/after loading
  • an unbounded lru_cache, used to speed up parameter accesses, may have kept parameters around indefinitely
  • with psgd, caution=True and foreach=False, caution was only applied on the first parameter
  • psgd quad with bf16 parameters tried to multiply a bf16 matrix with an fp32 matrix
  • psgd's preconditioner_update_probability was ignored if set to 0 - resulting in the default schedule being used
  • not all optimizers were exposed in __all__
  • psgd's scheduler was not stepped properly, causing the scheduler to remain at 100% precond update probability
  • msign/thinky_polar_express (a zeroth-power backend) always returned fp32 tensors, where it should've been adaptive to the input dtype
  • soap's initial eigh did not put the tensors back into their original dtype
  • the built-in ema exited after the first empty parameter group, potentially skipping updates
  • the built-in ema was updated once for each parameter group, causing different effective ema betas for different param group counts
  • finite differences hvp did not divide by the epsilon scaling
  • pointwise_lr_adaptation called lr_adaptation - not pointwise_lr_adaptation
  • fused_hook may have processed 1-tensor models incorrectly

Stable Muon

09 Oct 12:49
f7d7476

Choose a tag to compare

#78, #79 from @sozforex fix Muon's norm calculation and backport Polar Express' NewtonSchulz-10 iteration

Stability, Stability, Stability (and bug-fixes)

21 Sep 16:28

Choose a tag to compare

  • Higher numerical stability in Newton-Schulz orthogonalization (affects: Muon)
  • Higher accuracy in SVD computation (affects: PSGD)
  • Advanced checkpointing (affects: old checkpoints, SOAP and PSGD)
  • Reworked chainable backend, allowing more freedom in function composition (affects: custom optimizers)

For the full release notes and migration instructions, see here


  • Benchmark
    • extended to real-world tasks (#76 & #75, by @tom-jod)
    • improve hyperparam tuning; include the newly required depencies in the requirements (#70, by @MirzaSamad20)
    • fixed dependencies and configurations (#60, by @tom-jod)
  • Optimizers

Fixed SOAP, HVP PSGD

08 Mar 16:49
388190f

Choose a tag to compare

Bugfixes:

  • @francois-rozet fixed a severe convergence regression in SOAP. It's now faster and converges better than before (#42)
  • ADOPT now correctly matches the paper, significantly improving its convergence
  • FP64 storage and/or computation now works for more optimizers

Improvements:

  • NewtonPSGD now supports exact HVP calculation instead of the previous approximate. (Handles BatchNorm better but doesn't support all architectures.)
  • "smart_one_diag" is a next-to-no-downsides memory_save_mode for PSGD. It reduces memory and compute cost compared to memory_save_mode=None and improves convergence compared to memory_save_mode="one_diag"*

*Instead of preconditioning all dimensions (memory_save_mode=None) or preconditioning all but the largest dimension (memory_save_mode="one_diag") we remove the largest dimension iff it's larger than the second largest. So, a Linear(128, 1024) will now create one 128x128 preconditioner (instead of 128x128 + 1024x1024, 8x as large as the parameters), while a Linear(128, 128) can still benefit from preconditioning both sides.

OrthoGrad & PSGD improvements

18 Jan 07:57
512ffd0

Choose a tag to compare

  • General
    • precond_schedule matches its docs (@francois-rozet, #31)
    • unified warmup_steps API (@francois-rozet, #32 )
    • add eps arg to scale_by_adam (#33)
    • allow external management of LR (for foreach=True optimizers)
  • OrthoGrad, a "grokking-first" optimizer that works
  • PSGD
    • no more OOM in torch.linalg.solve
    • speed up cache by skipping it when it wouldn't give speedups
    • add newton-PSGD ("hvp-PSGD") using finite-difference approximation
    • caution momentum, not update (-> improved convergence; closer to paper's intention)
  • Benchmarks
    • grokking benchmark, using modular addition and wide MLPs

Fix PSGD, spring cleaning

01 Jan 15:50
0519edb

Choose a tag to compare

  • Previously, only the first parameter of PSGD was trained; This is fixed now
  • All PSGDs were PurePSGD - now momentum_into_precond_update and exp_avg_input have their expected effect again
  • preliminary support for external changes of group['lr']

v1.3.0

18 Dec 17:54
9a20be2

Choose a tag to compare

  • fixes: in 1.2.x (not 1.1.x), all optimizers were SGD; AdamW now runs AdamW again
  • heavyball.utils.disable_caution_scaling implements the behavior documented here
  • SOAP converges well again
    image