Implementations of deep learning optimizers using NumPy, including SGD, Adam, Adagrad, NAG, RMSProp, and Momentum.
This repository contains implementations of popular optimization algorithms using NumPy.
Each optimizer is explained intuitively with the corresponding mathematical formulas.
I compared my NumPy implementations of Adam, AdaGrad, NAG, and RMSProp against PyTorch on two datasets:
- California Housing (Regression): NumPy ran up to 7–10× faster with comparable final losses.
- Fashion-MNIST (Classification): NumPy was competitive with PyTorch, sometimes faster (Adam, AdaGrad) and sometimes slightly slower (RMSProp, NAG). Accuracy and loss trends matched closely.
Takeaway: NumPy is more efficient for small regression problems, while PyTorch scales better on larger classification tasks.
Optimizer | Task | NumPy Time | PyTorch Time | Speedup (↑ faster) | Accuracy (NumPy vs PyTorch) |
---|---|---|---|---|---|
Adam | Regression | 9.39s | 39.14s | 4.2× (NumPy) | Similar loss |
Adam | Classification | 12.72s | 6.27s | 0.5× (PyTorch) | 98.3% vs 97.7% |
AdaGrad | Regression | 3.21s | 35.61s | 11× (NumPy) | Similar loss |
AdaGrad | Classification | 11.54s | 6.03s | 0.5× (PyTorch) | 96.0% vs 95.9% |
NAG | Regression | 9.02s | 34.77s | 3.8× (NumPy) | Similar loss |
NAG | Classification | 9.32s | 5.74s | 0.6× (PyTorch) | 89.0% vs 98.6% |
RMSProp | Regression | 4.84s | 35.89s | 7.4× (NumPy) | Similar loss |
RMSProp | Classification | 10.39s | 5.68s | 0.5× (PyTorch) | 97.8% vs 98.0% |
Update rule:
Explanation:
Updates parameters by stepping in the opposite direction of the gradient computed on a mini-batch. Simple but sensitive to the choice of learning rate.
Important Note:
SGD, full-batch gradient descent, and mini-batch gradient descent are conceptually the same algorithm but differ in batch size:
- SGD: batch size = 1 (updates per single sample)
- Full-batch GD: batch size = number of training samples (updates after full dataset pass)
- Mini-batch GD: batch size = chosen mini-batch (e.g., 2, 32, 64, etc.)
Each setting affects noise and convergence speed.
Further Reading:
🔗 CS231n Optimization Overview (Stanford)
Update rule:
Explanation:
Adds velocity to the update, accumulating past gradients to smooth out oscillations and speed up convergence.
Further Reading:
🔗 Distill.pub — Visualizing Momentum
Update rule:
Explanation:
Improves upon Momentum by calculating the gradient at the projected future position, it "looks ahead" and slows down earlier, reducing overshooting.
Further Reading:
🔗 Gradient Descent With Nesterov Momentum From Scratch — Machine Learning Mastery
Update rule:
Explanation:
Adapts learning rates per parameter based on the cumulative sum of squared gradients, favoring infrequent parameters. Learning rates shrink over time.
Further Reading:
🔗 An overview of gradient descent optimization algorithms — Ruder
Update rule:
Explanation:
Uses an exponentially weighted moving average of squared gradients to fix AdaGrad's rapid decay of learning rates, stabilizing training.
Further Reading:
🔗 Geoff Hinton's Coursera Lecture — RMSProp
Update rule:
Bias correction:
Final update:
Explanation:
Combines Momentum and RMSProp by keeping track of both the first moment (mean) and second moment (variance) of gradients. It adapts learning rates per parameter and smooths updates, making it very effective for many tasks.
Further Reading:
🔗 Adam Paper — "Adam: A Method for Stochastic Optimization"
- Each optimizer is implemented in a clear and commented notebook cell.
- Experiment with hyperparameters like learning rate and momentum factor for best results.