optimizers-from-scratch

Implementations of deep learning optimizers using NumPy, including SGD, Adam, Adagrad, NAG, RMSProp, and Momentum.

Optimizers From Scratch in NumPy

This repository contains implementations of popular optimization algorithms using NumPy.
Each optimizer is explained intuitively with the corresponding mathematical formulas.

Benchmark Results

I compared my NumPy implementations of Adam, AdaGrad, NAG, and RMSProp against PyTorch on two datasets:

California Housing (Regression): NumPy ran up to 7–10× faster with comparable final losses.
Fashion-MNIST (Classification): NumPy was competitive with PyTorch, sometimes faster (Adam, AdaGrad) and sometimes slightly slower (RMSProp, NAG). Accuracy and loss trends matched closely.

Takeaway: NumPy is more efficient for small regression problems, while PyTorch scales better on larger classification tasks.

Summary Table

Optimizer	Task	NumPy Time	PyTorch Time	Speedup (↑ faster)	Accuracy (NumPy vs PyTorch)
Adam	Regression	9.39s	39.14s	4.2× (NumPy)	Similar loss
Adam	Classification	12.72s	6.27s	0.5× (PyTorch)	98.3% vs 97.7%
AdaGrad	Regression	3.21s	35.61s	11× (NumPy)	Similar loss
AdaGrad	Classification	11.54s	6.03s	0.5× (PyTorch)	96.0% vs 95.9%
NAG	Regression	9.02s	34.77s	3.8× (NumPy)	Similar loss
NAG	Classification	9.32s	5.74s	0.6× (PyTorch)	89.0% vs 98.6%
RMSProp	Regression	4.84s	35.89s	7.4× (NumPy)	Similar loss
RMSProp	Classification	10.39s	5.68s	0.5× (PyTorch)	97.8% vs 98.0%

Optimizers

1. Stochastic Gradient Descent (SGD)

Update rule:

$$ \theta_{t+1} = \theta_t - \eta \cdot \nabla_{\theta} J(\theta_t) $$

Explanation:
Updates parameters by stepping in the opposite direction of the gradient computed on a mini-batch. Simple but sensitive to the choice of learning rate.

Important Note:
SGD, full-batch gradient descent, and mini-batch gradient descent are conceptually the same algorithm but differ in batch size:

SGD: batch size = 1 (updates per single sample)
Full-batch GD: batch size = number of training samples (updates after full dataset pass)
Mini-batch GD: batch size = chosen mini-batch (e.g., 2, 32, 64, etc.)
Each setting affects noise and convergence speed.

Further Reading:
🔗 CS231n Optimization Overview (Stanford)

2. Momentum

Update rule:

$$ v_t = \gamma v_{t-1} + \eta \cdot \nabla_{\theta} J(\theta_t) $$

$$ \theta_{t+1} = \theta_t - v_t $$

Explanation:
Adds velocity to the update, accumulating past gradients to smooth out oscillations and speed up convergence.

Further Reading:
🔗 Distill.pub — Visualizing Momentum

3. Nesterov Accelerated Gradient (NAG)

Update rule:

$$ v_t = \gamma v_{t-1} + \eta \cdot \nabla_{\theta} J(\theta_t - \gamma v_{t-1}) $$

$$ \theta_{t+1} = \theta_t - v_t $$

Explanation:
Improves upon Momentum by calculating the gradient at the projected future position, it "looks ahead" and slows down earlier, reducing overshooting.

Further Reading:
🔗 Gradient Descent With Nesterov Momentum From Scratch — Machine Learning Mastery

4. AdaGrad

Update rule:

$$ r_t = r_{t-1} + \nabla_{\theta} J(\theta_t)^2 $$

$$ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{r_t} + \epsilon} \cdot \nabla_{\theta} J(\theta_t) $$

Explanation:
Adapts learning rates per parameter based on the cumulative sum of squared gradients, favoring infrequent parameters. Learning rates shrink over time.

Further Reading:
🔗 An overview of gradient descent optimization algorithms — Ruder

5. RMSProp

Update rule:

$$ r_t = \rho r_{t-1} + (1 - \rho) \cdot \nabla_{\theta} J(\theta_t)^2 $$

$$ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{r_t} + \epsilon} \cdot \nabla_{\theta} J(\theta_t) $$

Explanation:
Uses an exponentially weighted moving average of squared gradients to fix AdaGrad's rapid decay of learning rates, stabilizing training.

Further Reading:
🔗 Geoff Hinton's Coursera Lecture — RMSProp

6. Adam (Adaptive Moment Estimation)

Update rule:

$$ m_t = \beta_1 m_{t-1} + (1 - \beta_1) \cdot \nabla_{\theta} J(\theta_t) $$

$$ v_t = \beta_2 v_{t-1} + (1 - \beta_2) \cdot \nabla_{\theta} J(\theta_t)^2 $$

Bias correction:

$$ \hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t} $$

Final update:

$$ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \cdot \hat{m}_t $$

Explanation:
Combines Momentum and RMSProp by keeping track of both the first moment (mean) and second moment (variance) of gradients. It adapts learning rates per parameter and smooths updates, making it very effective for many tasks.

Further Reading:
🔗 Adam Paper — "Adam: A Method for Stochastic Optimization"

Usage

Each optimizer is implemented in a clear and commented notebook cell.
Experiment with hyperparameters like learning rate and momentum factor for best results.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
benchmark		benchmark
.gitignore		.gitignore
LICENCE		LICENCE
Optims_numpy.ipynb		Optims_numpy.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

optimizers-from-scratch

Optimizers From Scratch in NumPy

Benchmark Results

Summary Table

Optimizers

1. Stochastic Gradient Descent (SGD)

2. Momentum

3. Nesterov Accelerated Gradient (NAG)

4. AdaGrad

5. RMSProp

6. Adam (Adaptive Moment Estimation)

Usage

About

Uh oh!

Releases

Packages

Languages

License

Brokttv/optimizers-from-scratch

Folders and files

Latest commit

History

Repository files navigation

optimizers-from-scratch

Optimizers From Scratch in NumPy

Benchmark Results

Summary Table

Optimizers

1. Stochastic Gradient Descent (SGD)

2. Momentum

3. Nesterov Accelerated Gradient (NAG)

4. AdaGrad

5. RMSProp

6. Adam (Adaptive Moment Estimation)

Usage

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages