Skip to content

Commit 6716607

Browse files
authored
Merge pull request #460 from mavleo96/adadelta
Adadelta Optimizer
2 parents 8317003 + 54606cb commit 6716607

File tree

7 files changed

+181
-0
lines changed

7 files changed

+181
-0
lines changed
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Implement the Adadelta optimizer update step function. Your function should take the current parameter value, gradient, and moving averages as inputs, and return the updated parameter value and new moving averages. The function should handle both scalar and array inputs, and include proper input validation.
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{
2+
"input": "parameter = 1.0, grad = 0.1, u = 1.0, v = 1.0, rho = 0.95, epsilon = 1e-6",
3+
"output": "(0.89743, 0.9505, 0.95053)",
4+
"explanation": "The Adadelta optimizer computes updated values for the parameter, first moment (u), and second moment (v). With input values parameter=1.0, grad=0.1, u=1.0, v=1.0, and rho=0.95, the updated parameter becomes 0.89743."
5+
}
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# Implementing Adadelta Optimizer
2+
3+
## Introduction
4+
Adadelta is an extension of Adagrad that addresses two key issues: the aggressive, monotonically decreasing learning rate and the need for manual learning rate tuning. While Adagrad accumulates all past squared gradients, Adadelta restricts the influence of past gradients to a window of size w. Instead of explicitly storing w past gradients, it efficiently approximates this window using an exponential moving average with decay rate ρ, making it more robust to parameter updates. Additionally, it automatically handles the units of the updates, eliminating the need for a manually set learning rate.
5+
6+
## Learning Objectives
7+
- Understand how Adadelta optimizer works
8+
- Learn to implement adaptive learning rates with moving averages
9+
10+
## Theory
11+
Adadelta uses two main ideas:
12+
1. Exponential moving average of squared gradients to approximate a window of size w
13+
2. Automatic unit correction through the ratio of parameter updates
14+
15+
The key equations are:
16+
17+
$v_t = \rho v_{t-1} + (1-\rho)g_t^2$ (Exponential moving average of squared gradients)
18+
19+
The above approximates a window size of $w \approx \dfrac{1}{1-\rho}$
20+
21+
$\Delta\theta_t = -\dfrac{\sqrt{u_{t-1} + \epsilon}}{\sqrt{v_t + \epsilon}} \cdot g_t$ (Parameter update with unit correction)
22+
23+
$u_t = \rho u_{t-1} + (1-\rho)\Delta\theta_t^2$ (Exponential moving average of squared parameter updates)
24+
25+
Where:
26+
- $v_t$ is the exponential moving average of squared gradients (decay rate ρ)
27+
- $u_t$ is the exponential moving average of squared parameter updates (decay rate ρ)
28+
- $\rho$ is the decay rate (typically 0.9) that controls the effective window size w ≈ 1/(1-ρ)
29+
- $\epsilon$ is a small constant for numerical stability
30+
- $g_t$ is the gradient at time step t
31+
32+
The ratio $\dfrac{\sqrt{u_{t-1} + \epsilon}}{\sqrt{v_t + \epsilon}}$ serves as an adaptive learning rate that automatically handles the units of the updates, making the algorithm more robust to different parameter scales. Unlike Adagrad, Adadelta does not require a manually set learning rate, making it especially useful when tuning hyperparameters is difficult. This automatic learning rate adaptation is achieved through the ratio of the root mean squared (RMS) of parameter updates to the RMS of gradients.
33+
34+
Read more at:
35+
36+
1. Zeiler, M. D. (2012). ADADELTA: An Adaptive Learning Rate Method. [arXiv:1212.5701](https://arxiv.org/abs/1212.5701)
37+
2. Ruder, S. (2017). An overview of gradient descent optimization algorithms. [arXiv:1609.04747](https://arxiv.org/pdf/1609.04747)
38+
39+
## Problem Statement
40+
Implement the Adadelta optimizer update step function. Your function should take the current parameter value, gradient, and accumulated statistics as inputs, and return the updated parameter value and new accumulated statistics.
41+
42+
### Input Format
43+
The function should accept:
44+
- parameter: Current parameter value
45+
- grad: Current gradient
46+
- v: Exponentially decaying average of squared gradients
47+
- u: Exponentially decaying average of squared parameter updates
48+
- rho: Decay rate (default=0.9)
49+
- epsilon: Small constant for numerical stability (default=1e-8)
50+
51+
### Output Format
52+
Return tuple: (updated_parameter, updated_v, updated_u)
53+
54+
## Example
55+
```python
56+
# Example usage:
57+
parameter = 1.0
58+
grad = 0.1
59+
v = 1.0
60+
u = 1.0
61+
62+
new_param, new_v, new_u = adadelta_optimizer(parameter, grad, v, u)
63+
```
64+
65+
## Tips
66+
- Initialize v and u as zeros
67+
- Use numpy for numerical operations
68+
- Test with both scalar and array inputs
69+
- The learning rate is automatically determined by the algorithm
70+
71+
---
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
{
2+
"id": "149",
3+
"title": "Adadelta Optimizer",
4+
"difficulty": "medium",
5+
"category": "Deep Learning",
6+
"video": "",
7+
"likes": "0",
8+
"dislikes": "0",
9+
"contributor": [
10+
{
11+
"profile_link": "https://github.com/mavleo96",
12+
"name": "Vijayabharathi Murugan"
13+
}
14+
],
15+
"tinygrad_difficulty": null,
16+
"pytorch_difficulty": null
17+
}
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
import numpy as np
2+
3+
def adadelta_optimizer(parameter, grad, u, v, rho=0.95, epsilon=1e-6):
4+
"""
5+
Update parameters using the AdaDelta optimizer.
6+
AdaDelta is an extension of AdaGrad that seeks to reduce its aggressive,
7+
monotonically decreasing learning rate.
8+
9+
Args:
10+
parameter: Current parameter value
11+
grad: Current gradient
12+
u: Running average of squared gradients
13+
v: Running average of squared parameter updates
14+
rho: Decay rate for the moving average (default=0.95)
15+
epsilon: Small constant for numerical stability (default=1e-6)
16+
17+
Returns:
18+
tuple: (updated_parameter, updated_u, updated_v)
19+
"""
20+
assert 0 <= rho < 1, "Rho must be between 0 and 1"
21+
assert epsilon > 0, "Epsilon must be positive"
22+
assert all(u >= 0) if isinstance(u, np.ndarray) else u >= 0, "u must be non-negative"
23+
assert all(v >= 0) if isinstance(v, np.ndarray) else v >= 0, "v must be non-negative"
24+
25+
# Update running average of squared gradients
26+
u = rho * u + (1 - rho) * grad**2
27+
28+
# Compute RMS of gradient
29+
RMS_g = np.sqrt(u + epsilon)
30+
31+
# Compute RMS of parameter updates
32+
RMS_dx = np.sqrt(v + epsilon)
33+
34+
# Compute parameter update
35+
dx = -RMS_dx / RMS_g * grad
36+
37+
# Update running average of squared parameter updates
38+
v = rho * v + (1 - rho) * dx**2
39+
40+
# Update parameters
41+
parameter = parameter + dx
42+
43+
return np.round(parameter, 5), np.round(u, 5), np.round(v, 5)
44+
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
import numpy as np
2+
3+
def adadelta_optimizer(parameter, grad, u, v, rho=0.95, epsilon=1e-6):
4+
"""
5+
Update parameters using the AdaDelta optimizer.
6+
AdaDelta is an extension of AdaGrad that seeks to reduce its aggressive,
7+
monotonically decreasing learning rate.
8+
9+
Args:
10+
parameter: Current parameter value
11+
grad: Current gradient
12+
u: Running average of squared gradients
13+
v: Running average of squared parameter updates
14+
rho: Decay rate for the moving average (default=0.95)
15+
epsilon: Small constant for numerical stability (default=1e-6)
16+
17+
Returns:
18+
tuple: (updated_parameter, updated_u, updated_v)
19+
"""
20+
# Your code here
21+
return np.round(parameter, 5), np.round(u, 5), np.round(v, 5)
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
[
2+
{
3+
"test": "print(adadelta_optimizer(1., 0.5, 1., 1., 0.95, 1e-6))",
4+
"expected_output": "(0.49035, 0.9625, 0.96299)"
5+
},
6+
{
7+
"test": "print(adadelta_optimizer(np.array([1., 2.]), np.array([0.1, 0.2]), np.array([1., 1.]), np.array([1., 1.]), 0.95, 1e-6))",
8+
"expected_output": "(array([0.89743, 1.79502]), array([0.9505, 0.952]), array([0.95053, 0.9521]))"
9+
},
10+
{
11+
"test": "print(adadelta_optimizer(np.array([1., 2.]), np.array([0., 0.2]), np.array([0., 1.]), np.array([0., 1.]), 0.95, 1e-6))",
12+
"expected_output": "(array([1., 1.79502]), array([0., 0.952]), array([0., 0.9521]))"
13+
},
14+
{
15+
"test": "print(adadelta_optimizer(np.array([1., 1.]), np.array([1., 1.]), np.array([10000., 1.]), np.array([1., 1.]), 0.95, 1e-6))",
16+
"expected_output": "(array([0.98974, 0.]), array([9500.05, 1.]), array([0.95001, 1.]))"
17+
},
18+
{
19+
"test": "print(adadelta_optimizer(1., 0.5, 1., 1., 0., 1e-6))",
20+
"expected_output": "(0., 0.25, 1.0)"
21+
}
22+
]

0 commit comments

Comments
 (0)