Skip to content

Commit 5662e66

Browse files
authored
Merge pull request #388 from turkunov/dyt
New Problem: Dynamic Tanh
2 parents f0ede5a + 137934c commit 5662e66

File tree

2 files changed

+107
-0
lines changed

2 files changed

+107
-0
lines changed

Problems/128_dyt/learn.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
A new study (https://arxiv.org/pdf/2503.10622) demonstrates that layer normalization, that is ubiquitous in transformers, produces Tanh-like S-shapes. By incorporating a new layer replacement for normalization called "Dynamic Tanh" (DyT for short), Transformers without normalization can match or exceed the performance of their normalized counterparts, mostly without hyperparameter tuning.
2+
3+
### Normalization layer
4+
Consider an standard NLP task, where an input $x$ has a shape of $(B,T,C)$, where $B$ is the batch size, $T$ - number of tokens (sequence length) and $C$ - embedding dimensions. Then an output of a normalization layer is generally computed as $norm(x)=\gamma(\frac{x-\mu}{\sqrt{\sigma^2+\varepsilon}})+\beta$, where $\gamma$ and $\beta$ are learnable parameters of shape $(C,)$. Distribution's statistics are calculated as follows: $\mu_k=\frac{1}{BT}\sum_i^B\sum_j^Tx_{ij}$; $\sigma_k^2=\frac{1}{B T} \sum_{i, j}\left(x_{i j k}-\mu_k\right)^2$
5+
6+
### Hyperboloic tangent (Tanh)
7+
Tanh function is defined as a ratio: $tanh(x)=\frac{sinh(x)}{cosh(x)}=\frac{exp(x)-exp(-x)}{exp(x)+exp(-x)}$. Essentially the function allows transformation of an arbitrary domain to $[-1,1]$.
8+
9+
### Dynamic Tanh (DyT)
10+
Turns out that LN (layer normalization) produces different parts of a $tanh(kx)$, where $k$ controls the curvature of the tanh curve in the center. The smaller the $k$, the smoother is the change from $-1$ to $1$. Hence the study proposes a drop-in replacement for LN given an input tensor $x$:
11+
12+
$$
13+
DyT(x)=\gamma*tanh(\alpha x)+\beta,
14+
$$
15+
16+
where:
17+
* $\alpha$ - learnable parameter that allows scaling the input differently based on its range (tokens producing **smaller variance** produce **less smoother curves**). Authors suggest a **default value** of $0.5$.
18+
* $\gamma, \beta$ - learnable parameters, that scale our output based on the input. Authors suggest initializing these vectors with following **default values**:
19+
* $\gamma$ as all-one vector
20+
* $\beta$ as all-zero
21+
22+
Despite not calculating statistics, DyT preserves the "squashing" effect of LN on extreme values in a non-linear fashion, while almost linearly transforming central parts of the input.

Problems/128_dyt/solution.py

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
import numpy as np
2+
3+
4+
def dynamic_tanh(x: np.ndarray, alpha: float, gamma: float, beta: float) -> list[float]:
5+
"""
6+
Applies DyT to an array. Could serve as a replacement
7+
for layer normalization in Transformers.
8+
9+
Parameters
10+
----------
11+
x : np.ndarray
12+
Input tensor of shape (B,T,C)
13+
alpha : float
14+
Learnable parameter of the DyT layer
15+
gamma : float
16+
Learnable scaling parameter vector of shape (C, ) of the DyT layer
17+
beta : float
18+
Learnable scaling parameter vector of shape (C, ) of the DyT layer
19+
eps : float
20+
Epsilon constant
21+
22+
Returns
23+
-------
24+
x : list[float]
25+
Input x with DyT applied to it and rounded up to 4 floating points
26+
"""
27+
28+
def tanh(x: np.ndarray) -> np.ndarray:
29+
return (np.exp(x) - np.exp(-x)) / (np.exp(x) + np.exp(-x))
30+
31+
x = tanh(alpha * x)
32+
return (x * gamma + beta).round(4).tolist()
33+
34+
35+
def test_dynamic_tanh():
36+
alpha = .5
37+
38+
# Test 1
39+
x = np.array([[[0.14115588, 0.00372817, 0.24126647, 0.22183601],
40+
[0.36301332, 0.67681456, 0.3723281 , 0.62767559],
41+
[0.94926205, 0.80230257, 0.19737574, 0.04460771],
42+
[0.43777021, 0.95744001, 0.60795979, 0.58980314],
43+
[0.27250625, 0.48053656, 0.11087151, 0.06228769]],
44+
[[0.12620219, 0.63002473, 0.75673539, 0.60411435],
45+
[0.3918192 , 0.39810709, 0.42186426, 0.79954607],
46+
[0.67730682, 0.96539769, 0.13366266, 0.44462357],
47+
[0.31556188, 0.86050486, 0.96060468, 0.43953706],
48+
[0.80002165, 0.39582123, 0.35731605, 0.83600622]]])
49+
gamma, beta = np.ones(shape=(x.shape[2])), np.zeros(shape=(x.shape[2]))
50+
expected_x = [[[0.0705, 0.0019, 0.1201, 0.1105],
51+
[0.1795, 0.3261, 0.184, 0.3039],
52+
[0.4419, 0.3809, 0.0984, 0.0223],
53+
[0.2155, 0.4452, 0.295, 0.2866],
54+
[0.1354, 0.2357, 0.0554, 0.0311]],
55+
[[0.063, 0.305, 0.3613, 0.2932],
56+
[0.1934, 0.1965, 0.2079, 0.3798],
57+
[0.3263, 0.4484, 0.0667, 0.2187],
58+
[0.1565, 0.4055, 0.4465, 0.2163],
59+
[0.38, 0.1954, 0.1768, 0.3952]]]
60+
output_x = dynamic_tanh(x, alpha, gamma, beta)
61+
assert expected_x == output_x, 'Test case 1 failed'
62+
63+
# Test 2
64+
x = np.array([[[0.20793482, 0.16989285, 0.03898972],
65+
[0.17912554, 0.10962205, 0.3870742],
66+
[0.00107181, 0.35807922, 0.15861333]]])
67+
gamma, beta = np.ones(shape=(x.shape[2])), np.zeros(shape=(x.shape[2]))
68+
expected_x = [[[0.1036, 0.0847, 0.0195],
69+
[0.0893, 0.0548, 0.1912],
70+
[0.0005, 0.1772, 0.0791]]]
71+
output_x = dynamic_tanh(x, alpha, gamma, beta)
72+
assert expected_x == output_x, 'Test case 2 failed'
73+
74+
# Test 3
75+
x = np.array([[[0.94378259]],[[0.97754654]],[[0.36168351]],[[0.51821078]],[[0.76961589]]])
76+
gamma, beta = np.ones(shape=(x.shape[2])), np.zeros(shape=(x.shape[2]))
77+
expected_x = [[[0.4397]],[[0.4532]],[[0.1789]],[[0.2535]],[[0.3669]]]
78+
output_x = dynamic_tanh(x, alpha, gamma, beta)
79+
assert expected_x == output_x, 'Test case 3 failed'
80+
81+
print('All tests passed')
82+
83+
84+
if __name__ == '__main__':
85+
test_dynamic_tanh()

0 commit comments

Comments
 (0)