|
| 1 | +# Implementing Adamax Optimizer |
| 2 | + |
| 3 | +## Introduction |
| 4 | +Adamax is a variant of Adam optimizer that uses the infinity norm (max) instead of the L2 norm for the second moment estimate. This makes it more robust in some cases and can lead to better convergence in certain scenarios, particularly when dealing with sparse gradients. |
| 5 | + |
| 6 | +## Learning Objectives |
| 7 | +- Understand how Adamax optimization works |
| 8 | +- Learn to implement Adamax-based gradient updates |
| 9 | +- Understand the effect of infinity norm on optimization |
| 10 | + |
| 11 | +## Theory |
| 12 | +Adamax maintains a moving average of gradients (first moment) and uses the infinity norm for the second moment estimate. The key equations are: |
| 13 | + |
| 14 | +First moment estimate (same as Adam): |
| 15 | + |
| 16 | +$m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t$ |
| 17 | + |
| 18 | +The second moment estimate in Adam uses the $l_2$ norm: |
| 19 | + |
| 20 | +$v_t = \beta_2 v_{t-1} + (1-\beta_2)|g_t|^2$ |
| 21 | + |
| 22 | +This can be generalized to the $l_p$ norm, but norms for large p values are numerically unstable. However, Adamax uses the $l_\infin$ norm (infinity norm), which converges to: |
| 23 | + |
| 24 | +$u_t = \max(\beta_2 \cdot u_{t-1}, |g_t|)$ |
| 25 | + |
| 26 | +Unlike Adam, Adamax doesn't require bias correction for $u_t$ because the max operation makes it less susceptible to bias towards zero. |
| 27 | + |
| 28 | +Bias correction: |
| 29 | +$\hat{m}_t = \dfrac{m_t}{1-\beta_1^t}$ |
| 30 | + |
| 31 | +Parameter update: |
| 32 | +$\theta_t = \theta_{t-1} - \dfrac{\eta}{u_t} \hat{m}_t$ |
| 33 | + |
| 34 | +Where: |
| 35 | +- $m_t$ is the first moment estimate at time t |
| 36 | +- $u_t$ is the infinity norm estimate at time t |
| 37 | +- $\beta_1$ is the first moment coefficient (typically 0.9) |
| 38 | +- $\beta_2$ is the second moment coefficient (typically 0.999) |
| 39 | +- $\eta$ is the learning rate |
| 40 | +- $g_t$ is the gradient at time t |
| 41 | + |
| 42 | +Note: Unlike Adam, Adamax doesn't require bias correction for $u_t$ because the max operation makes it less susceptible to bias towards zero. |
| 43 | + |
| 44 | +Read more at: |
| 45 | + |
| 46 | +1. Kingma, D. and Ba, J. (2015). Adam: A Method for Stochastic Optimization. [arXiv:1412.6980](https://arxiv.org/abs/1412.6980) |
| 47 | +2. Ruder, S. (2017). An overview of gradient descent optimization algorithms. [arXiv:1609.04747](https://arxiv.org/pdf/1609.04747) |
| 48 | + |
| 49 | + |
| 50 | +## Problem Statement |
| 51 | +Implement the Adamax optimizer update step function. Your function should take the current parameter value, gradient, and moment estimates as inputs, and return the updated parameter value and new moment estimates. |
| 52 | + |
| 53 | +### Input Format |
| 54 | +The function should accept: |
| 55 | +- parameter: Current parameter value |
| 56 | +- grad: Current gradient |
| 57 | +- m: First moment estimate |
| 58 | +- u: Infinity norm estimate |
| 59 | +- t: Current timestep |
| 60 | +- learning_rate: Learning rate (default=0.002) |
| 61 | +- beta1: First moment decay rate (default=0.9) |
| 62 | +- beta2: Second moment decay rate (default=0.999) |
| 63 | +- epsilon: Small constant for numerical stability (default=1e-8) |
| 64 | + |
| 65 | +### Output Format |
| 66 | +Return tuple: (updated_parameter, updated_m, updated_u) |
| 67 | + |
| 68 | +## Example |
| 69 | +```python |
| 70 | +# Example usage: |
| 71 | +parameter = 1.0 |
| 72 | +grad = 0.1 |
| 73 | +m = 0.0 |
| 74 | +u = 0.0 |
| 75 | +t = 1 |
| 76 | + |
| 77 | +new_param, new_m, new_u = adamax_optimizer(parameter, grad, m, u, t) |
| 78 | +``` |
| 79 | + |
| 80 | +## Tips |
| 81 | +- Initialize m and u as zeros |
| 82 | +- Keep track of timestep t for bias correction |
| 83 | +- Use numpy for numerical operations |
| 84 | +- Test with both scalar and array inputs |
| 85 | +- Remember to apply bias correction to the first moment estimate |
| 86 | + |
| 87 | +--- |
0 commit comments