|
| 1 | +# Implementing Adagrad Optimizer |
| 2 | + |
| 3 | +## Introduction |
| 4 | +Adagrad (Adaptive Gradient Algorithm) is an optimization algorithm that adapts the learning rate to each parameter, performing larger updates for infrequent parameters and smaller updates for frequent ones. This makes it particularly well-suited for dealing with sparse data. |
| 5 | + |
| 6 | +## Learning Objectives |
| 7 | +- Understand how Adagrad optimizer works |
| 8 | +- Learn to implement adaptive learning rates |
| 9 | +- Gain practical experience with gradient-based optimization |
| 10 | + |
| 11 | +## Theory |
| 12 | +Adagrad adapts the learning rate for each parameter based on the historical gradients. The key equations are: |
| 13 | + |
| 14 | +$G_t = G_{t-1} + g_t^2$ (Accumulated squared gradients) |
| 15 | + |
| 16 | +$\theta_t = \theta_{t-1} - \dfrac{\alpha}{\sqrt{G_t} + \epsilon} \cdot g_t$ (Parameter update) |
| 17 | + |
| 18 | +Where: |
| 19 | +- $G_t$ is the sum of squared gradients up to time step t |
| 20 | +- $\alpha$ is the initial learning rate |
| 21 | +- $\epsilon$ is a small constant for numerical stability |
| 22 | +- $g_t$ is the gradient at time step t |
| 23 | + |
| 24 | +Read more at: |
| 25 | + |
| 26 | +1. Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12, 2121–2159. [PDF](https://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf) |
| 27 | +2. Ruder, S. (2017). An overview of gradient descent optimization algorithms. [arXiv:1609.04747](https://arxiv.org/pdf/1609.04747) |
| 28 | + |
| 29 | + |
| 30 | +## Problem Statement |
| 31 | +Implement the Adagrad optimizer update step function. Your function should take the current parameter value, gradient, and accumulated squared gradients as inputs, and return the updated parameter value and new accumulated squared gradients. |
| 32 | + |
| 33 | +### Input Format |
| 34 | +The function should accept: |
| 35 | +- parameter: Current parameter value |
| 36 | +- grad: Current gradient |
| 37 | +- G: Accumulated squared gradients |
| 38 | +- learning_rate: Learning rate (default=0.01) |
| 39 | +- epsilon: Small constant for numerical stability (default=1e-8) |
| 40 | + |
| 41 | +### Output Format |
| 42 | +Return tuple: (updated_parameter, updated_G) |
| 43 | + |
| 44 | +## Example |
| 45 | +```python |
| 46 | +# Example usage: |
| 47 | +parameter = 1.0 |
| 48 | +grad = 0.1 |
| 49 | +G = 1.0 |
| 50 | + |
| 51 | +new_param, new_G = adagrad_optimizer(parameter, grad, G) |
| 52 | +``` |
| 53 | + |
| 54 | +## Tips |
| 55 | +- Initialize G as zeros |
| 56 | +- Use numpy for numerical operations |
| 57 | +- Test with both scalar and array inputs |
| 58 | + |
| 59 | +--- |
0 commit comments