Skip to content

Commit 103e9a8

Browse files
authored
Merge branch 'Open-Deep-ML:main' into nesterov
2 parents 6e18f91 + 8317003 commit 103e9a8

File tree

74 files changed

+2782
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

74 files changed

+2782
-0
lines changed
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
2+
# Learn Section
3+
4+
## Understanding Bhattacharyya Distance
5+
6+
**Bhattacharyya Distance (BD)** is a concept in statistics used to measure the **similarity** or **overlap** between two probability distributions **P(x)** and **Q(x)** on the same domain **x**.
7+
8+
This differs from **KL Divergence**, which measures the **loss of information** when projecting one probability distribution onto another (reference distribution).
9+
10+
### **Bhattacharyya Distance Formula**
11+
The Bhattacharyya distance is defined as:
12+
13+
$$
14+
BC (P, Q) = \sum \sqrt{P(X) \cdot Q(X)}
15+
$$
16+
17+
$$
18+
BD (P, Q) = -\ln(BC (P, Q))
19+
$$
20+
21+
where **BC (P, Q)** is the **Bhattacharyya coefficient**.
22+
23+
### **Key Properties**
24+
1. **BD is always non-negative**:
25+
$$ BD \geq 0 $$
26+
2. **Symmetric in nature**:
27+
$$ BD (P, Q) = BD (Q, P) $$
28+
3. **Applications**:
29+
- Risk assessment
30+
- Stock predictions
31+
- Feature scaling
32+
- Classification problems
33+
34+
### **Example Calculation**
35+
Consider two probability distributions **P(x)** and **Q(x)**:
36+
37+
$$
38+
P(x) = [0.1, 0.2, 0.3, 0.4], \quad Q(x) = [0.4, 0.3, 0.2, 0.1]
39+
$$
40+
41+
1. **Bhattacharyya Coefficient**:
42+
43+
$$
44+
BC (P, Q) = \sum \sqrt{P(X) \cdot Q(X)} = 0.8898
45+
$$
46+
47+
2. **Bhattacharyya Distance**:
48+
49+
$$
50+
BD (P, Q) = -\ln(BC (P, Q)) = -\ln(0.8898) = 0.1166
51+
$$
52+
53+
This illustrates how BD quantifies the **overlap** between two probability distributions.
54+
55+
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
import numpy as np
2+
3+
def bhattacharyya_distance(p : list[float], q : list[float]) -> float:
4+
5+
if len(p) != len(q) :
6+
return 0.0
7+
8+
p, q = np.array(p), np.array(q)
9+
10+
BC = np.sum(np.sqrt(p * q)) #### Bhattacharya coefficient
11+
12+
DB = -np.log(BC) #### Bhattacharya distance
13+
14+
return round(DB, 4)
15+
16+
def test_bhattacharyya_distance() -> None:
17+
18+
# Test Case 1
19+
p = [0.1, 0.2, 0.3, 0.4]
20+
q = [0.4, 0.3, 0.2, 0.1]
21+
assert bhattacharyya_distance(p, q) == 0.1166
22+
23+
# Test Case 2
24+
p = [0.7, 0.2, 0.1]
25+
q = [0.4, 0.3, 0.3]
26+
assert bhattacharyya_distance(p, q) == 0.0541
27+
28+
# Test Case 3
29+
p = []
30+
q = [0.5, 0.4, 0.1]
31+
assert bhattacharyya_distance(p, q) == 0.0
32+
33+
# Test Case 4
34+
p = [0.6, 0.4]
35+
q = [0.1, 0.7, 0.2]
36+
assert bhattacharyya_distance(p, q) == 0.0
37+
38+
# Test Case 5
39+
p = [0.6, 0.2, 0.1, 0.1]
40+
q = [0.1, 0.2, 0.3, 0.4]
41+
assert bhattacharyya_distance(p, q) == 0.2007
42+
43+
if __name__ == '__main__':
44+
45+
test_bhattacharyya_distance()
46+
print('All Bhattacharyya Distance test cases passed')
47+
48+
49+

Problems/128_dyt/learn.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
A new study (https://arxiv.org/pdf/2503.10622) demonstrates that layer normalization, that is ubiquitous in transformers, produces Tanh-like S-shapes. By incorporating a new layer replacement for normalization called "Dynamic Tanh" (DyT for short), Transformers without normalization can match or exceed the performance of their normalized counterparts, mostly without hyperparameter tuning.
2+
3+
### Normalization layer
4+
Consider an standard NLP task, where an input $x$ has a shape of $(B,T,C)$, where $B$ is the batch size, $T$ - number of tokens (sequence length) and $C$ - embedding dimensions. Then an output of a normalization layer is generally computed as $norm(x)=\gamma(\frac{x-\mu}{\sqrt{\sigma^2+\varepsilon}})+\beta$, where $\gamma$ and $\beta$ are learnable parameters of shape $(C,)$. Distribution's statistics are calculated as follows: $\mu_k=\frac{1}{BT}\sum_i^B\sum_j^Tx_{ij}$; $\sigma_k^2=\frac{1}{B T} \sum_{i, j}\left(x_{i j k}-\mu_k\right)^2$
5+
6+
### Hyperboloic tangent (Tanh)
7+
Tanh function is defined as a ratio: $tanh(x)=\frac{sinh(x)}{cosh(x)}=\frac{exp(x)-exp(-x)}{exp(x)+exp(-x)}$. Essentially the function allows transformation of an arbitrary domain to $[-1,1]$.
8+
9+
### Dynamic Tanh (DyT)
10+
Turns out that LN (layer normalization) produces different parts of a $tanh(kx)$, where $k$ controls the curvature of the tanh curve in the center. The smaller the $k$, the smoother is the change from $-1$ to $1$. Hence the study proposes a drop-in replacement for LN given an input tensor $x$:
11+
12+
$$
13+
DyT(x)=\gamma*tanh(\alpha x)+\beta,
14+
$$
15+
16+
where:
17+
* $\alpha$ - learnable parameter that allows scaling the input differently based on its range (tokens producing **smaller variance** produce **less smoother curves**). Authors suggest a **default value** of $0.5$.
18+
* $\gamma, \beta$ - learnable parameters, that scale our output based on the input. Authors suggest initializing these vectors with following **default values**:
19+
* $\gamma$ as all-one vector
20+
* $\beta$ as all-zero
21+
22+
Despite not calculating statistics, DyT preserves the "squashing" effect of LN on extreme values in a non-linear fashion, while almost linearly transforming central parts of the input.

Problems/128_dyt/solution.py

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
import numpy as np
2+
3+
4+
def dynamic_tanh(x: np.ndarray, alpha: float, gamma: float, beta: float) -> list[float]:
5+
"""
6+
Applies DyT to an array. Could serve as a replacement
7+
for layer normalization in Transformers.
8+
9+
Parameters
10+
----------
11+
x : np.ndarray
12+
Input tensor of shape (B,T,C)
13+
alpha : float
14+
Learnable parameter of the DyT layer
15+
gamma : float
16+
Learnable scaling parameter vector of shape (C, ) of the DyT layer
17+
beta : float
18+
Learnable scaling parameter vector of shape (C, ) of the DyT layer
19+
eps : float
20+
Epsilon constant
21+
22+
Returns
23+
-------
24+
x : list[float]
25+
Input x with DyT applied to it and rounded up to 4 floating points
26+
"""
27+
28+
def tanh(x: np.ndarray) -> np.ndarray:
29+
return (np.exp(x) - np.exp(-x)) / (np.exp(x) + np.exp(-x))
30+
31+
x = tanh(alpha * x)
32+
return (x * gamma + beta).round(4).tolist()
33+
34+
35+
def test_dynamic_tanh():
36+
alpha = .5
37+
38+
# Test 1
39+
x = np.array([[[0.14115588, 0.00372817, 0.24126647, 0.22183601],
40+
[0.36301332, 0.67681456, 0.3723281 , 0.62767559],
41+
[0.94926205, 0.80230257, 0.19737574, 0.04460771],
42+
[0.43777021, 0.95744001, 0.60795979, 0.58980314],
43+
[0.27250625, 0.48053656, 0.11087151, 0.06228769]],
44+
[[0.12620219, 0.63002473, 0.75673539, 0.60411435],
45+
[0.3918192 , 0.39810709, 0.42186426, 0.79954607],
46+
[0.67730682, 0.96539769, 0.13366266, 0.44462357],
47+
[0.31556188, 0.86050486, 0.96060468, 0.43953706],
48+
[0.80002165, 0.39582123, 0.35731605, 0.83600622]]])
49+
gamma, beta = np.ones(shape=(x.shape[2])), np.zeros(shape=(x.shape[2]))
50+
expected_x = [[[0.0705, 0.0019, 0.1201, 0.1105],
51+
[0.1795, 0.3261, 0.184, 0.3039],
52+
[0.4419, 0.3809, 0.0984, 0.0223],
53+
[0.2155, 0.4452, 0.295, 0.2866],
54+
[0.1354, 0.2357, 0.0554, 0.0311]],
55+
[[0.063, 0.305, 0.3613, 0.2932],
56+
[0.1934, 0.1965, 0.2079, 0.3798],
57+
[0.3263, 0.4484, 0.0667, 0.2187],
58+
[0.1565, 0.4055, 0.4465, 0.2163],
59+
[0.38, 0.1954, 0.1768, 0.3952]]]
60+
output_x = dynamic_tanh(x, alpha, gamma, beta)
61+
assert expected_x == output_x, 'Test case 1 failed'
62+
63+
# Test 2
64+
x = np.array([[[0.20793482, 0.16989285, 0.03898972],
65+
[0.17912554, 0.10962205, 0.3870742],
66+
[0.00107181, 0.35807922, 0.15861333]]])
67+
gamma, beta = np.ones(shape=(x.shape[2])), np.zeros(shape=(x.shape[2]))
68+
expected_x = [[[0.1036, 0.0847, 0.0195],
69+
[0.0893, 0.0548, 0.1912],
70+
[0.0005, 0.1772, 0.0791]]]
71+
output_x = dynamic_tanh(x, alpha, gamma, beta)
72+
assert expected_x == output_x, 'Test case 2 failed'
73+
74+
# Test 3
75+
x = np.array([[[0.94378259]],[[0.97754654]],[[0.36168351]],[[0.51821078]],[[0.76961589]]])
76+
gamma, beta = np.ones(shape=(x.shape[2])), np.zeros(shape=(x.shape[2]))
77+
expected_x = [[[0.4397]],[[0.4532]],[[0.1789]],[[0.2535]],[[0.3669]]]
78+
output_x = dynamic_tanh(x, alpha, gamma, beta)
79+
assert expected_x == output_x, 'Test case 3 failed'
80+
81+
print('All tests passed')
82+
83+
84+
if __name__ == '__main__':
85+
test_dynamic_tanh()
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
## Multi-class Cross-Entropy Loss Implementation
2+
3+
Cross-entropy loss, also known as log loss, measures the performance of a classification model whose output is a probability value between 0 and 1. For multi-class classification tasks, we use the categorical cross-entropy loss.
4+
5+
### Mathematical Background
6+
7+
For a single sample with C classes, the categorical cross-entropy loss is defined as:
8+
9+
$L = -\sum_{c=1}^{C} y_c \log(p_c)$
10+
11+
where:
12+
13+
- $y_c$ is a binary indicator (0 or 1) if class label c is the correct classification for the sample
14+
- $p_c$ is the predicted probability that the sample belongs to class c
15+
- $C$ is the number of classes
16+
17+
### Implementation Requirements
18+
19+
Your task is to implement a function that computes the average cross-entropy loss across multiple samples:
20+
21+
$L_{batch} = -\frac{1}{N}\sum_{n=1}^{N}\sum_{c=1}^{C} y_{n,c} \log(p_{n,c})$
22+
23+
where N is the number of samples in the batch.
24+
25+
### Important Considerations
26+
27+
- Handle numerical stability by adding a small epsilon to avoid log(0)
28+
- Ensure predicted probabilities sum to 1 for each sample
29+
- Return average loss across all samples
30+
- Handle invalid inputs appropriately
31+
32+
The function should take predicted probabilities and true labels as input and return the average cross-entropy loss.
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
import numpy as np
2+
3+
def compute_cross_entropy_loss(predicted_probs: np.ndarray, true_labels: np.ndarray) -> float:
4+
5+
#Given
6+
epsilon = 1e-15
7+
predicted_probs = np.clip(predicted_probs, epsilon, 1 - epsilon)
8+
9+
#Write your code here
10+
log_probs = np.log(predicted_probs)
11+
loss = -np.sum(true_labels * log_probs, axis=1)
12+
return float(np.mean(loss))
13+
14+
def test_compute_cross_entropy_loss():
15+
# Test case 1: Perfect predictions
16+
pred1 = np.array([[1, 0, 0], [0, 1, 0]])
17+
true1 = np.array([[1, 0, 0], [0, 1, 0]])
18+
expected1 = 0.0
19+
assert np.isclose(compute_cross_entropy_loss(pred1, true1), expected1), "Test case 1 failed"
20+
21+
# Test case 2: Completely wrong predictions
22+
pred2 = np.array([[0.1, 0.8, 0.1], [0.8, 0.1, 0.1]])
23+
true2 = np.array([[0, 0, 1], [0, 1, 0]])
24+
expected2 = -np.mean([np.log(0.1), np.log(0.1)])
25+
assert np.isclose(compute_cross_entropy_loss(pred2, true2), expected2), "Test case 2 failed"
26+
27+
# Test case 3: Typical predictions
28+
pred3 = np.array([[0.7, 0.2, 0.1], [0.3, 0.6, 0.1]])
29+
true3 = np.array([[1, 0, 0], [0, 1, 0]])
30+
expected3 = -np.mean([np.log(0.7), np.log(0.6)])
31+
assert np.isclose(compute_cross_entropy_loss(pred3, true3), expected3), "Test case 3 failed"
32+
33+
if __name__ == "__main__":
34+
test_compute_cross_entropy_loss()
35+
print("All test cases passed!")
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
## Implementing Early Stopping Criterion
2+
3+
Early stopping is a regularization technique that helps prevent overfitting in machine learning models. Your task is to implement the early stopping decision logic based on the validation loss history.
4+
5+
### Problem Description
6+
7+
Given a sequence of validation losses from model training, determine if training should be stopped based on the following criteria:
8+
9+
- Training should stop if the validation loss hasn't improved (decreased) for a specified number of epochs (patience)
10+
- An improvement is only counted if the loss decreases by more than a minimum threshold (min_delta)
11+
- The best model is the one with the lowest validation loss
12+
13+
### Example
14+
15+
Consider the following validation losses: [0.9, 0.8, 0.75, 0.77, 0.76, 0.77, 0.78]
16+
17+
- With patience=2 and min_delta=0.01:
18+
- Best loss is 0.75 at epoch 2
19+
- No improvement > 0.01 for next 2 epochs
20+
- Should stop at epoch 4
21+
22+
### Function Requirements
23+
24+
- Return both the epoch to stop at and the best epoch
25+
- If no stopping is needed, return the last epoch
26+
- Epochs are 0-indexed
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
import numpy as np
2+
from typing import Tuple
3+
4+
def early_stopping(val_losses: list[float], patience: int, min_delta: float) -> Tuple[int, int]:
5+
6+
best_loss = float('inf')
7+
best_epoch = 0
8+
epochs_without_improvement = 0
9+
10+
for epoch, loss in enumerate(val_losses):
11+
# Check if current loss is better than best loss by at least min_delta
12+
if loss < best_loss - min_delta:
13+
best_loss = loss
14+
best_epoch = epoch
15+
epochs_without_improvement = 0
16+
else:
17+
epochs_without_improvement += 1
18+
19+
# Check if we should stop
20+
if epochs_without_improvement >= patience:
21+
return epoch, best_epoch
22+
23+
# If we never hit the patience threshold, return the last epoch
24+
return len(val_losses) - 1, best_epoch
25+
26+
def test_early_stopping():
27+
28+
losses1 = [0.9, 0.8, 0.75, 0.77, 0.76, 0.77, 0.78]
29+
stop_epoch1, best_epoch1 = early_stopping(losses1, patience=2, min_delta=0.01)
30+
assert stop_epoch1 == 4 and best_epoch1 == 2, "Test case 1 failed"
31+
32+
losses2 = [0.9, 0.8, 0.7, 0.6, 0.5]
33+
stop_epoch2, best_epoch2 = early_stopping(losses2, patience=2, min_delta=0.01)
34+
assert stop_epoch2 == 4 and best_epoch2 == 4, "Test case 2 failed"
35+
36+
losses3 = [0.9, 0.8, 0.79, 0.78, 0.77]
37+
stop_epoch3, best_epoch3 = early_stopping(losses3, patience=2, min_delta=0.1)
38+
assert stop_epoch3 == 4 and best_epoch3 == 2, "Test case 3 failed"
39+
40+
losses4 = [0.5, 0.4]
41+
stop_epoch4, best_epoch4 = early_stopping(losses4, patience=3, min_delta=0.01)
42+
assert stop_epoch4 == 1 and best_epoch4 == 1, "Test case 4 failed"
43+
44+
losses5 = [0.5, 0.4, 0.4, 0.4, 0.4]
45+
stop_epoch5, best_epoch5 = early_stopping(losses5, patience=2, min_delta=0.01)
46+
assert stop_epoch5 == 3 and best_epoch5 == 1, "Test case 5 failed"
47+
48+
if __name__ == "__main__":
49+
test_early_stopping()
50+
print("All test cases passed!")

0 commit comments

Comments
 (0)