Skip to content

Commit 66930fc

Browse files
authored
Merge pull request #458 from mavleo96/adamax
Adamax Optimizer
2 parents 6716607 + 4f90bb2 commit 66930fc

File tree

7 files changed

+194
-0
lines changed

7 files changed

+194
-0
lines changed
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Implement the Adamax optimizer update step function. Your function should take the current parameter value, gradient, and moving averages as inputs, and return the updated parameter value and new moving averages. The function should also handle scalar and array inputs and include bias correction for the moving averages.
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{
2+
"input": "parameter = 1.0, grad = 0.1, m = 0.0, u = 0.0, t = 1",
3+
"output": "(0.998, 0.01, 0.1)",
4+
"reasoning": "The Adamax optimizer computes updated values for the parameter, first moment (m), and infinity norm (u) using bias-corrected estimates of gradients. With input values parameter=1.0, grad=0.1, m=0.0, u=0.0, and t=1, the updated parameter becomes 0.998, the updated m becomes 0.01, and the updated u becomes 0.1."
5+
}
Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
# Implementing Adamax Optimizer
2+
3+
## Introduction
4+
Adamax is a variant of Adam optimizer that uses the infinity norm (max) instead of the L2 norm for the second moment estimate. This makes it more robust in some cases and can lead to better convergence in certain scenarios, particularly when dealing with sparse gradients.
5+
6+
## Learning Objectives
7+
- Understand how Adamax optimization works
8+
- Learn to implement Adamax-based gradient updates
9+
- Understand the effect of infinity norm on optimization
10+
11+
## Theory
12+
Adamax maintains a moving average of gradients (first moment) and uses the infinity norm for the second moment estimate. The key equations are:
13+
14+
First moment estimate (same as Adam):
15+
16+
$m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t$
17+
18+
The second moment estimate in Adam uses the $l_2$ norm:
19+
20+
$v_t = \beta_2 v_{t-1} + (1-\beta_2)|g_t|^2$
21+
22+
This can be generalized to the $l_p$ norm, but norms for large p values are numerically unstable. However, Adamax uses the $l_\infin$ norm (infinity norm), which converges to:
23+
24+
$u_t = \max(\beta_2 \cdot u_{t-1}, |g_t|)$
25+
26+
Unlike Adam, Adamax doesn't require bias correction for $u_t$ because the max operation makes it less susceptible to bias towards zero.
27+
28+
Bias correction:
29+
$\hat{m}_t = \dfrac{m_t}{1-\beta_1^t}$
30+
31+
Parameter update:
32+
$\theta_t = \theta_{t-1} - \dfrac{\eta}{u_t} \hat{m}_t$
33+
34+
Where:
35+
- $m_t$ is the first moment estimate at time t
36+
- $u_t$ is the infinity norm estimate at time t
37+
- $\beta_1$ is the first moment coefficient (typically 0.9)
38+
- $\beta_2$ is the second moment coefficient (typically 0.999)
39+
- $\eta$ is the learning rate
40+
- $g_t$ is the gradient at time t
41+
42+
Note: Unlike Adam, Adamax doesn't require bias correction for $u_t$ because the max operation makes it less susceptible to bias towards zero.
43+
44+
Read more at:
45+
46+
1. Kingma, D. and Ba, J. (2015). Adam: A Method for Stochastic Optimization. [arXiv:1412.6980](https://arxiv.org/abs/1412.6980)
47+
2. Ruder, S. (2017). An overview of gradient descent optimization algorithms. [arXiv:1609.04747](https://arxiv.org/pdf/1609.04747)
48+
49+
50+
## Problem Statement
51+
Implement the Adamax optimizer update step function. Your function should take the current parameter value, gradient, and moment estimates as inputs, and return the updated parameter value and new moment estimates.
52+
53+
### Input Format
54+
The function should accept:
55+
- parameter: Current parameter value
56+
- grad: Current gradient
57+
- m: First moment estimate
58+
- u: Infinity norm estimate
59+
- t: Current timestep
60+
- learning_rate: Learning rate (default=0.002)
61+
- beta1: First moment decay rate (default=0.9)
62+
- beta2: Second moment decay rate (default=0.999)
63+
- epsilon: Small constant for numerical stability (default=1e-8)
64+
65+
### Output Format
66+
Return tuple: (updated_parameter, updated_m, updated_u)
67+
68+
## Example
69+
```python
70+
# Example usage:
71+
parameter = 1.0
72+
grad = 0.1
73+
m = 0.0
74+
u = 0.0
75+
t = 1
76+
77+
new_param, new_m, new_u = adamax_optimizer(parameter, grad, m, u, t)
78+
```
79+
80+
## Tips
81+
- Initialize m and u as zeros
82+
- Keep track of timestep t for bias correction
83+
- Use numpy for numerical operations
84+
- Test with both scalar and array inputs
85+
- Remember to apply bias correction to the first moment estimate
86+
87+
---
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
{
2+
"id": "148",
3+
"title": "Adamax Optimizer",
4+
"difficulty": "easy",
5+
"category": "Deep Learning",
6+
"video": "",
7+
"likes": "0",
8+
"dislikes": "0",
9+
"contributor": [
10+
{
11+
"profile_link": "https://github.com/mavleo96",
12+
"name": "Vijayabharathi Murugan"
13+
}
14+
],
15+
"tinygrad_difficulty": null,
16+
"pytorch_difficulty": null
17+
}
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
import numpy as np
2+
3+
def adamax_optimizer(parameter, grad, m, u, t, learning_rate=0.002, beta1=0.9, beta2=0.999, epsilon=1e-8):
4+
"""
5+
Update parameters using the Adamax optimizer.
6+
Adamax is a variant of Adam based on the infinity norm.
7+
It uses the maximum of past squared gradients instead of the exponential moving average.
8+
9+
Args:
10+
parameter: Current parameter value
11+
grad: Current gradient
12+
m: First moment estimate
13+
u: Infinity norm estimate
14+
t: Current timestep
15+
learning_rate: Learning rate (default=0.002)
16+
beta1: First moment decay rate (default=0.9)
17+
beta2: Infinity norm decay rate (default=0.999)
18+
epsilon: Small constant for numerical stability (default=1e-8)
19+
20+
Returns:
21+
tuple: (updated_parameter, updated_m, updated_u)
22+
"""
23+
assert learning_rate > 0, "Learning rate must be positive"
24+
assert 0 <= beta1 < 1, "Beta1 must be between 0 and 1"
25+
assert 0 <= beta2 < 1, "Beta2 must be between 0 and 1"
26+
assert epsilon > 0, "Epsilon must be positive"
27+
assert all(u >= 0) if isinstance(u, np.ndarray) else u >= 0, "u must be non-negative"
28+
29+
# Update biased first moment estimate
30+
m = beta1 * m + (1 - beta1) * grad
31+
32+
# Update infinity norm estimate
33+
u = np.maximum(beta2 * u, np.abs(grad))
34+
35+
# Compute bias-corrected first moment estimate
36+
m_hat = m / (1 - beta1**t)
37+
38+
# Update parameters
39+
update = learning_rate * m_hat / (u + epsilon)
40+
parameter = parameter - update
41+
42+
return np.round(parameter, 5), np.round(m, 5), np.round(u, 5)
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
import numpy as np
2+
3+
def adamax_optimizer(parameter, grad, m, u, t, learning_rate=0.002, beta1=0.9, beta2=0.999, epsilon=1e-8):
4+
"""
5+
Update parameters using the Adamax optimizer.
6+
Adamax is a variant of Adam based on the infinity norm.
7+
It uses the maximum of past squared gradients instead of the exponential moving average.
8+
9+
Args:
10+
parameter: Current parameter value
11+
grad: Current gradient
12+
m: First moment estimate
13+
u: Infinity norm estimate
14+
t: Current timestep
15+
learning_rate: Learning rate (default=0.002)
16+
beta1: First moment decay rate (default=0.9)
17+
beta2: Infinity norm decay rate (default=0.999)
18+
epsilon: Small constant for numerical stability (default=1e-8)
19+
20+
Returns:
21+
tuple: (updated_parameter, updated_m, updated_u)
22+
"""
23+
# Your code here
24+
return np.round(parameter, 5), np.round(m, 5), np.round(u, 5)
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
[
2+
{
3+
"test": "print(adamax_optimizer(1., 0.1, 1., 1., 1, 0.002, 0.9, 0.999, 1e-8))",
4+
"expected_output": "(0.98178, 0.91, 0.999)"
5+
},
6+
{
7+
"test": "print(adamax_optimizer(np.array([1., 2.]), np.array([0.1, 0.2]), np.array([1., 1.]), np.array([1., 1.]), 1, 0.002, 0.9, 0.999, 1e-8))",
8+
"expected_output": "(array([0.98178, 1.98158]), array([0.91, 0.92]), array([0.999, 0.999]))"
9+
},
10+
{
11+
"test": "print(adamax_optimizer(np.array([1., 2.]), np.array([0.0, 0.0]), np.array([0.1, 0.1]), np.array([0., 0.]), 1, 0.002, 0.9, 0.999, 1e-8))",
12+
"expected_output": "(array([-179999., -179998.]), array([0.09, 0.09]), array([0., 0.]))"
13+
},
14+
{
15+
"test": "print(adamax_optimizer(1., 0.1, 1., 1., 1, 0.002, 0., 0., 1e-8))",
16+
"expected_output": "(0.998, 0.1, 0.1)"
17+
}
18+
]

0 commit comments

Comments
 (0)