Merge pull request #458 from mavleo96/adamax

moe18 · web-flow · commit 66930fc80d3c · 2025-06-28T21:31:57.000-04:00
Adamax Optimizer
diff --git a/questions/148_adamax-optimizer/description.md b/questions/148_adamax-optimizer/description.md
@@ -0,0 +1 @@
+Implement the Adamax optimizer update step function. Your function should take the current parameter value, gradient, and moving averages as inputs, and return the updated parameter value and new moving averages. The function should also handle scalar and array inputs and include bias correction for the moving averages.
diff --git a/questions/148_adamax-optimizer/example.json b/questions/148_adamax-optimizer/example.json
@@ -0,0 +1,5 @@
+{
+  "input": "parameter = 1.0, grad = 0.1, m = 0.0, u = 0.0, t = 1",
+  "output": "(0.998, 0.01, 0.1)",
+  "reasoning": "The Adamax optimizer computes updated values for the parameter, first moment (m), and infinity norm (u) using bias-corrected estimates of gradients. With input values parameter=1.0, grad=0.1, m=0.0, u=0.0, and t=1, the updated parameter becomes 0.998, the updated m becomes 0.01, and the updated u becomes 0.1."
+}
diff --git a/questions/148_adamax-optimizer/learn.md b/questions/148_adamax-optimizer/learn.md
@@ -0,0 +1,87 @@
+# Implementing Adamax Optimizer
+
+## Introduction
+Adamax is a variant of Adam optimizer that uses the infinity norm (max) instead of the L2 norm for the second moment estimate. This makes it more robust in some cases and can lead to better convergence in certain scenarios, particularly when dealing with sparse gradients.
+
+## Learning Objectives
+- Understand how Adamax optimization works
+- Learn to implement Adamax-based gradient updates
+- Understand the effect of infinity norm on optimization
+
+## Theory
+Adamax maintains a moving average of gradients (first moment) and uses the infinity norm for the second moment estimate. The key equations are:
+
+First moment estimate (same as Adam):
+
+$m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t$
+
+The second moment estimate in Adam uses the $l_2$ norm:
+
+$v_t = \beta_2 v_{t-1} + (1-\beta_2)|g_t|^2$
+
+This can be generalized to the $l_p$ norm, but norms for large p values are numerically unstable. However, Adamax uses the $l_\infin$ norm (infinity norm), which converges to:
+
+$u_t = \max(\beta_2 \cdot u_{t-1}, |g_t|)$
+
+Unlike Adam, Adamax doesn't require bias correction for $u_t$ because the max operation makes it less susceptible to bias towards zero.
+
+Bias correction:
+$\hat{m}_t = \dfrac{m_t}{1-\beta_1^t}$
+
+Parameter update:
+$\theta_t = \theta_{t-1} - \dfrac{\eta}{u_t} \hat{m}_t$
+
+Where:
+- $m_t$ is the first moment estimate at time t
+- $u_t$ is the infinity norm estimate at time t
+- $\beta_1$ is the first moment coefficient (typically 0.9)
+- $\beta_2$ is the second moment coefficient (typically 0.999)
+- $\eta$ is the learning rate
+- $g_t$ is the gradient at time t
+
+Note: Unlike Adam, Adamax doesn't require bias correction for $u_t$ because the max operation makes it less susceptible to bias towards zero.
+
+Read more at:
+
+1. Kingma, D. and Ba, J. (2015). Adam: A Method for Stochastic Optimization. [arXiv:1412.6980](https://arxiv.org/abs/1412.6980)
+2. Ruder, S. (2017). An overview of gradient descent optimization algorithms. [arXiv:1609.04747](https://arxiv.org/pdf/1609.04747)
+
+
+## Problem Statement
+Implement the Adamax optimizer update step function. Your function should take the current parameter value, gradient, and moment estimates as inputs, and return the updated parameter value and new moment estimates.
+
+### Input Format
+The function should accept:
+- parameter: Current parameter value
+- grad: Current gradient
+- m: First moment estimate
+- u: Infinity norm estimate
+- t: Current timestep
+- learning_rate: Learning rate (default=0.002)
+- beta1: First moment decay rate (default=0.9)
+- beta2: Second moment decay rate (default=0.999)
+- epsilon: Small constant for numerical stability (default=1e-8)
+
+### Output Format
+Return tuple: (updated_parameter, updated_m, updated_u)
+
+## Example
+```python
+# Example usage:
+parameter = 1.0
+grad = 0.1
+m = 0.0
+u = 0.0
+t = 1
+
+new_param, new_m, new_u = adamax_optimizer(parameter, grad, m, u, t)
+```
+
+## Tips
+- Initialize m and u as zeros
+- Keep track of timestep t for bias correction
+- Use numpy for numerical operations
+- Test with both scalar and array inputs
+- Remember to apply bias correction to the first moment estimate
+
+---
diff --git a/questions/148_adamax-optimizer/meta.json b/questions/148_adamax-optimizer/meta.json
@@ -0,0 +1,17 @@
+{
+  "id": "148",
+  "title": "Adamax Optimizer",
+  "difficulty": "easy",
+  "category": "Deep Learning",
+  "video": "",
+  "likes": "0",
+  "dislikes": "0",
+  "contributor": [
+    {
+      "profile_link": "https://github.com/mavleo96",
+      "name": "Vijayabharathi Murugan"
+    }
+  ],
+  "tinygrad_difficulty": null,
+  "pytorch_difficulty": null
+}
diff --git a/questions/148_adamax-optimizer/solution.py b/questions/148_adamax-optimizer/solution.py
@@ -0,0 +1,42 @@
+import numpy as np
+
+def adamax_optimizer(parameter, grad, m, u, t, learning_rate=0.002, beta1=0.9, beta2=0.999, epsilon=1e-8):
+    """
+    Update parameters using the Adamax optimizer.
+    Adamax is a variant of Adam based on the infinity norm.
+    It uses the maximum of past squared gradients instead of the exponential moving average.
+
+    Args:
+        parameter: Current parameter value
+        grad: Current gradient
+        m: First moment estimate
+        u: Infinity norm estimate
+        t: Current timestep
+        learning_rate: Learning rate (default=0.002)
+        beta1: First moment decay rate (default=0.9)
+        beta2: Infinity norm decay rate (default=0.999)
+        epsilon: Small constant for numerical stability (default=1e-8)
+
+    Returns:
+        tuple: (updated_parameter, updated_m, updated_u)
+    """
+    assert learning_rate > 0, "Learning rate must be positive"
+    assert 0 <= beta1 < 1, "Beta1 must be between 0 and 1"
+    assert 0 <= beta2 < 1, "Beta2 must be between 0 and 1"
+    assert epsilon > 0, "Epsilon must be positive"
+    assert all(u >= 0) if isinstance(u, np.ndarray) else u >= 0, "u must be non-negative"
+
+    # Update biased first moment estimate
+    m = beta1 * m + (1 - beta1) * grad
+
+    # Update infinity norm estimate
+    u = np.maximum(beta2 * u, np.abs(grad))
+
+    # Compute bias-corrected first moment estimate
+    m_hat = m / (1 - beta1**t)
+
+    # Update parameters
+    update = learning_rate * m_hat / (u + epsilon)
+    parameter = parameter - update
+
+    return np.round(parameter, 5), np.round(m, 5), np.round(u, 5)
diff --git a/questions/148_adamax-optimizer/starter_code.py b/questions/148_adamax-optimizer/starter_code.py
@@ -0,0 +1,24 @@
+import numpy as np
+
+def adamax_optimizer(parameter, grad, m, u, t, learning_rate=0.002, beta1=0.9, beta2=0.999, epsilon=1e-8):
+    """
+    Update parameters using the Adamax optimizer.
+    Adamax is a variant of Adam based on the infinity norm.
+    It uses the maximum of past squared gradients instead of the exponential moving average.
+
+    Args:
+        parameter: Current parameter value
+        grad: Current gradient
+        m: First moment estimate
+        u: Infinity norm estimate
+        t: Current timestep
+        learning_rate: Learning rate (default=0.002)
+        beta1: First moment decay rate (default=0.9)
+        beta2: Infinity norm decay rate (default=0.999)
+        epsilon: Small constant for numerical stability (default=1e-8)
+
+    Returns:
+        tuple: (updated_parameter, updated_m, updated_u)
+    """
+	# Your code here
+    return np.round(parameter, 5), np.round(m, 5), np.round(u, 5)
diff --git a/questions/148_adamax-optimizer/tests.json b/questions/148_adamax-optimizer/tests.json
@@ -0,0 +1,18 @@
+[
+  {
+    "test": "print(adamax_optimizer(1., 0.1, 1., 1., 1, 0.002, 0.9, 0.999, 1e-8))",
+    "expected_output": "(0.98178, 0.91, 0.999)"
+  },
+  {
+    "test": "print(adamax_optimizer(np.array([1., 2.]), np.array([0.1, 0.2]), np.array([1., 1.]), np.array([1., 1.]), 1, 0.002, 0.9, 0.999, 1e-8))",
+    "expected_output": "(array([0.98178, 1.98158]), array([0.91, 0.92]), array([0.999, 0.999]))"
+  },
+  {
+    "test": "print(adamax_optimizer(np.array([1., 2.]), np.array([0.0, 0.0]), np.array([0.1, 0.1]), np.array([0., 0.]), 1, 0.002, 0.9, 0.999, 1e-8))",
+    "expected_output": "(array([-179999., -179998.]), array([0.09, 0.09]), array([0., 0.]))"
+  },
+  {
+    "test": "print(adamax_optimizer(1., 0.1, 1., 1., 1, 0.002, 0., 0., 1e-8))",
+    "expected_output": "(0.998, 0.1, 0.1)"
+  }
+]

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+Implement the Adamax optimizer update step function. Your function should take the current parameter value, gradient, and moving averages as inputs, and return the updated parameter value and new moving averages. The function should also handle scalar and array inputs and include bias correction for the moving averages.`