Open-Deep-ML
diff --git a/‎Problems/120_bhattacharya_distance/learn.md‎
Lines changed: 55 additions & 0 deletions b/‎Problems/120_bhattacharya_distance/learn.md‎
Lines changed: 55 additions & 0 deletions
diff --git a/‎Problems/120_bhattacharya_distance/solution.py‎
Lines changed: 49 additions & 0 deletions b/‎Problems/120_bhattacharya_distance/solution.py‎
Lines changed: 49 additions & 0 deletions
diff --git a/‎Problems/128_dyt/learn.md‎
Lines changed: 22 additions & 0 deletions b/‎Problems/128_dyt/learn.md‎
Lines changed: 22 additions & 0 deletions
diff --git a/‎Problems/128_dyt/solution.py‎
Lines changed: 85 additions & 0 deletions b/‎Problems/128_dyt/solution.py‎
Lines changed: 85 additions & 0 deletions
diff --git a/‎Problems/134_cross_entropy_loss/learn.md‎
Lines changed: 32 additions & 0 deletions b/‎Problems/134_cross_entropy_loss/learn.md‎
Lines changed: 32 additions & 0 deletions
diff --git a/‎Problems/134_cross_entropy_loss/solution.py‎
Lines changed: 35 additions & 0 deletions b/‎Problems/134_cross_entropy_loss/solution.py‎
Lines changed: 35 additions & 0 deletions
diff --git a/‎Problems/135_early_stopping/learn.md‎
Lines changed: 26 additions & 0 deletions b/‎Problems/135_early_stopping/learn.md‎
Lines changed: 26 additions & 0 deletions
diff --git a/‎Problems/135_early_stopping/solution.py‎
Lines changed: 50 additions & 0 deletions b/‎Problems/135_early_stopping/solution.py‎
Lines changed: 50 additions & 0 deletions
@@ -0,0 +1,55 @@
+
+# Learn Section
+
+## Understanding Bhattacharyya Distance
+
+**Bhattacharyya Distance (BD)** is a concept in statistics used to measure the **similarity** or **overlap** between two probability distributions **P(x)** and **Q(x)** on the same domain **x**.  
+
+This differs from **KL Divergence**, which measures the **loss of information** when projecting one probability distribution onto another (reference distribution).  
+
+### **Bhattacharyya Distance Formula**
+The Bhattacharyya distance is defined as:  
+
+$$
+BC (P, Q) = \sum \sqrt{P(X) \cdot Q(X)}
+$$
+
+$$
+BD (P, Q) = -\ln(BC (P, Q))
+$$
+
+where **BC (P, Q)** is the **Bhattacharyya coefficient**.  
+
+### **Key Properties**
+1. **BD is always non-negative**:  
+   $$ BD \geq 0 $$
+2. **Symmetric in nature**:  
+   $$ BD (P, Q) = BD (Q, P) $$
+3. **Applications**:  
+   - Risk assessment  
+   - Stock predictions  
+   - Feature scaling  
+   - Classification problems  
+
+### **Example Calculation**
+Consider two probability distributions **P(x)** and **Q(x)**:  
+
+$$
+P(x) = [0.1, 0.2, 0.3, 0.4], \quad Q(x) = [0.4, 0.3, 0.2, 0.1]
+$$
+
+1. **Bhattacharyya Coefficient**:  
+
+$$
+BC (P, Q) = \sum \sqrt{P(X) \cdot Q(X)} = 0.8898
+$$
+
+2. **Bhattacharyya Distance**:  
+
+$$
+BD (P, Q) = -\ln(BC (P, Q)) = -\ln(0.8898) = 0.1166
+$$
+
+This illustrates how BD quantifies the **overlap** between two probability distributions.  
+
+    
@@ -0,0 +1,49 @@
+import numpy as np
+
+def bhattacharyya_distance(p : list[float], q : list[float]) -> float:
+
+    if len(p) != len(q) :
+        return 0.0
+    
+    p, q = np.array(p), np.array(q)
+
+    BC = np.sum(np.sqrt(p * q))    #### Bhattacharya coefficient
+
+    DB = -np.log(BC)               #### Bhattacharya distance
+
+    return round(DB, 4)
+
+def test_bhattacharyya_distance() -> None:
+
+    # Test Case 1
+    p = [0.1, 0.2, 0.3, 0.4]
+    q = [0.4, 0.3, 0.2, 0.1]
+    assert bhattacharyya_distance(p, q) == 0.1166
+
+    # Test Case 2
+    p = [0.7, 0.2, 0.1]
+    q = [0.4, 0.3, 0.3]
+    assert bhattacharyya_distance(p, q) == 0.0541
+
+    # Test Case 3
+    p = []
+    q = [0.5, 0.4, 0.1]
+    assert bhattacharyya_distance(p, q) == 0.0
+
+    # Test Case 4
+    p = [0.6, 0.4]
+    q = [0.1, 0.7, 0.2]
+    assert bhattacharyya_distance(p, q) == 0.0
+
+    # Test Case 5
+    p = [0.6, 0.2, 0.1, 0.1]
+    q = [0.1, 0.2, 0.3, 0.4]
+    assert bhattacharyya_distance(p, q) == 0.2007
+
+if __name__ == '__main__':
+
+    test_bhattacharyya_distance()
+    print('All Bhattacharyya Distance test cases passed')
+
+
+
@@ -0,0 +1,22 @@
+A new study (https://arxiv.org/pdf/2503.10622) demonstrates that layer normalization, that is ubiquitous in transformers, produces Tanh-like S-shapes. By incorporating a new layer replacement for normalization called "Dynamic Tanh" (DyT for short), Transformers without normalization can match or exceed the performance of their normalized counterparts, mostly without hyperparameter tuning.
+
+### Normalization layer
+Consider an standard NLP task, where an input $x$ has a shape of $(B,T,C)$, where $B$ is the batch size, $T$ - number of tokens (sequence length) and $C$ - embedding dimensions. Then an output of a normalization layer is generally computed as $norm(x)=\gamma(\frac{x-\mu}{\sqrt{\sigma^2+\varepsilon}})+\beta$, where $\gamma$ and $\beta$ are learnable parameters of shape $(C,)$. Distribution's statistics are calculated as follows: $\mu_k=\frac{1}{BT}\sum_i^B\sum_j^Tx_{ij}$; $\sigma_k^2=\frac{1}{B T} \sum_{i, j}\left(x_{i j k}-\mu_k\right)^2$
+
+### Hyperboloic tangent (Tanh)
+Tanh function is defined as a ratio: $tanh(x)=\frac{sinh(x)}{cosh(x)}=\frac{exp(x)-exp(-x)}{exp(x)+exp(-x)}$. Essentially the function allows transformation of an arbitrary domain to $[-1,1]$. 
+
+### Dynamic Tanh (DyT)
+Turns out that LN (layer normalization) produces different parts of a $tanh(kx)$, where $k$ controls the curvature of the tanh curve in the center. The smaller the $k$, the smoother is the change from $-1$ to $1$. Hence the study proposes a drop-in replacement for LN given an input tensor $x$:
+
+$$
+DyT(x)=\gamma*tanh(\alpha x)+\beta,
+$$
+
+where:
+* $\alpha$ - learnable parameter that allows scaling the input differently based on its range (tokens producing **smaller variance** produce **less smoother curves**). Authors suggest a **default value** of $0.5$.
+* $\gamma, \beta$ - learnable parameters, that scale our output based on the input. Authors suggest initializing these vectors with following **default values**:
+    * $\gamma$ as all-one vector 
+    * $\beta$ as all-zero
+
+Despite not calculating statistics, DyT preserves the "squashing" effect of LN on extreme values in a non-linear fashion, while almost linearly transforming central parts of the input.
@@ -0,0 +1,85 @@
+import numpy as np
+
+
+def dynamic_tanh(x: np.ndarray, alpha: float, gamma: float, beta: float) -> list[float]:
+    """
+    Applies DyT to an array. Could serve as a replacement
+    for layer normalization in Transformers. 
+
+    Parameters
+    ----------
+    x : np.ndarray
+        Input tensor of shape (B,T,C)
+    alpha : float
+        Learnable parameter of the DyT layer
+    gamma : float 
+        Learnable scaling parameter vector of shape (C, ) of the DyT layer
+    beta : float
+        Learnable scaling parameter vector of shape (C, ) of the DyT layer
+    eps : float
+        Epsilon constant
+
+    Returns
+    -------
+    x : list[float]
+        Input x with DyT applied to it and rounded up to 4 floating points
+    """
+
+    def tanh(x: np.ndarray) -> np.ndarray:
+        return (np.exp(x) - np.exp(-x)) / (np.exp(x) + np.exp(-x))
+
+    x = tanh(alpha * x)
+    return (x * gamma + beta).round(4).tolist()
+
+
+def test_dynamic_tanh():
+    alpha = .5
+
+    # Test 1
+    x = np.array([[[0.14115588, 0.00372817, 0.24126647, 0.22183601],
+        [0.36301332, 0.67681456, 0.3723281 , 0.62767559],
+        [0.94926205, 0.80230257, 0.19737574, 0.04460771],
+        [0.43777021, 0.95744001, 0.60795979, 0.58980314],
+        [0.27250625, 0.48053656, 0.11087151, 0.06228769]],
+       [[0.12620219, 0.63002473, 0.75673539, 0.60411435],
+        [0.3918192 , 0.39810709, 0.42186426, 0.79954607],
+        [0.67730682, 0.96539769, 0.13366266, 0.44462357],
+        [0.31556188, 0.86050486, 0.96060468, 0.43953706],
+        [0.80002165, 0.39582123, 0.35731605, 0.83600622]]])
+    gamma, beta = np.ones(shape=(x.shape[2])), np.zeros(shape=(x.shape[2]))
+    expected_x = [[[0.0705, 0.0019, 0.1201, 0.1105],
+        [0.1795, 0.3261, 0.184, 0.3039],
+        [0.4419, 0.3809, 0.0984, 0.0223],
+        [0.2155, 0.4452, 0.295, 0.2866],
+        [0.1354, 0.2357, 0.0554, 0.0311]],
+        [[0.063, 0.305, 0.3613, 0.2932],
+        [0.1934, 0.1965, 0.2079, 0.3798],
+        [0.3263, 0.4484, 0.0667, 0.2187],
+        [0.1565, 0.4055, 0.4465, 0.2163],
+        [0.38, 0.1954, 0.1768, 0.3952]]]
+    output_x = dynamic_tanh(x, alpha, gamma, beta)
+    assert expected_x == output_x, 'Test case 1 failed'
+
+    # Test 2
+    x = np.array([[[0.20793482, 0.16989285, 0.03898972],
+        [0.17912554, 0.10962205, 0.3870742],
+        [0.00107181, 0.35807922, 0.15861333]]])
+    gamma, beta = np.ones(shape=(x.shape[2])), np.zeros(shape=(x.shape[2]))
+    expected_x = [[[0.1036, 0.0847, 0.0195],
+        [0.0893, 0.0548, 0.1912],
+        [0.0005, 0.1772, 0.0791]]]
+    output_x = dynamic_tanh(x, alpha, gamma, beta)
+    assert expected_x == output_x, 'Test case 2 failed'
+
+    # Test 3
+    x = np.array([[[0.94378259]],[[0.97754654]],[[0.36168351]],[[0.51821078]],[[0.76961589]]])
+    gamma, beta = np.ones(shape=(x.shape[2])), np.zeros(shape=(x.shape[2]))
+    expected_x = [[[0.4397]],[[0.4532]],[[0.1789]],[[0.2535]],[[0.3669]]]
+    output_x = dynamic_tanh(x, alpha, gamma, beta)
+    assert expected_x == output_x, 'Test case 3 failed'
+
+    print('All tests passed')
+
+
+if __name__ == '__main__':
+    test_dynamic_tanh()
@@ -0,0 +1,32 @@
+## Multi-class Cross-Entropy Loss Implementation
+
+Cross-entropy loss, also known as log loss, measures the performance of a classification model whose output is a probability value between 0 and 1. For multi-class classification tasks, we use the categorical cross-entropy loss.
+
+### Mathematical Background
+
+For a single sample with C classes, the categorical cross-entropy loss is defined as:
+
+$L = -\sum_{c=1}^{C} y_c \log(p_c)$
+
+where:
+
+- $y_c$ is a binary indicator (0 or 1) if class label c is the correct classification for the sample
+- $p_c$ is the predicted probability that the sample belongs to class c
+- $C$ is the number of classes
+
+### Implementation Requirements
+
+Your task is to implement a function that computes the average cross-entropy loss across multiple samples:
+
+$L_{batch} = -\frac{1}{N}\sum_{n=1}^{N}\sum_{c=1}^{C} y_{n,c} \log(p_{n,c})$
+
+where N is the number of samples in the batch.
+
+### Important Considerations
+
+- Handle numerical stability by adding a small epsilon to avoid log(0)
+- Ensure predicted probabilities sum to 1 for each sample
+- Return average loss across all samples
+- Handle invalid inputs appropriately
+
+The function should take predicted probabilities and true labels as input and return the average cross-entropy loss.
@@ -0,0 +1,35 @@
+import numpy as np
+
+def compute_cross_entropy_loss(predicted_probs: np.ndarray, true_labels: np.ndarray) -> float:
+ 
+    #Given
+    epsilon = 1e-15
+    predicted_probs = np.clip(predicted_probs, epsilon, 1 - epsilon)
+    
+    #Write your code here
+    log_probs = np.log(predicted_probs)
+    loss = -np.sum(true_labels * log_probs, axis=1)
+    return float(np.mean(loss))
+
+def test_compute_cross_entropy_loss():
+    # Test case 1: Perfect predictions
+    pred1 = np.array([[1, 0, 0], [0, 1, 0]])
+    true1 = np.array([[1, 0, 0], [0, 1, 0]])
+    expected1 = 0.0
+    assert np.isclose(compute_cross_entropy_loss(pred1, true1), expected1), "Test case 1 failed"
+
+    # Test case 2: Completely wrong predictions
+    pred2 = np.array([[0.1, 0.8, 0.1], [0.8, 0.1, 0.1]])
+    true2 = np.array([[0, 0, 1], [0, 1, 0]])
+    expected2 = -np.mean([np.log(0.1), np.log(0.1)])
+    assert np.isclose(compute_cross_entropy_loss(pred2, true2), expected2), "Test case 2 failed"
+
+    # Test case 3: Typical predictions
+    pred3 = np.array([[0.7, 0.2, 0.1], [0.3, 0.6, 0.1]])
+    true3 = np.array([[1, 0, 0], [0, 1, 0]])
+    expected3 = -np.mean([np.log(0.7), np.log(0.6)])
+    assert np.isclose(compute_cross_entropy_loss(pred3, true3), expected3), "Test case 3 failed"
+
+if __name__ == "__main__":
+    test_compute_cross_entropy_loss()
+    print("All test cases passed!")
@@ -0,0 +1,26 @@
+## Implementing Early Stopping Criterion
+
+Early stopping is a regularization technique that helps prevent overfitting in machine learning models. Your task is to implement the early stopping decision logic based on the validation loss history.
+
+### Problem Description
+
+Given a sequence of validation losses from model training, determine if training should be stopped based on the following criteria:
+
+- Training should stop if the validation loss hasn't improved (decreased) for a specified number of epochs (patience)
+- An improvement is only counted if the loss decreases by more than a minimum threshold (min_delta)
+- The best model is the one with the lowest validation loss
+
+### Example
+
+Consider the following validation losses: [0.9, 0.8, 0.75, 0.77, 0.76, 0.77, 0.78]
+
+- With patience=2 and min_delta=0.01:
+  - Best loss is 0.75 at epoch 2
+  - No improvement > 0.01 for next 2 epochs
+  - Should stop at epoch 4
+
+### Function Requirements
+
+- Return both the epoch to stop at and the best epoch
+- If no stopping is needed, return the last epoch
+- Epochs are 0-indexed
@@ -0,0 +1,50 @@
+import numpy as np
+from typing import Tuple
+
+def early_stopping(val_losses: list[float], patience: int, min_delta: float) -> Tuple[int, int]:
+    
+    best_loss = float('inf')
+    best_epoch = 0
+    epochs_without_improvement = 0
+    
+    for epoch, loss in enumerate(val_losses):
+        # Check if current loss is better than best loss by at least min_delta
+        if loss < best_loss - min_delta:
+            best_loss = loss
+            best_epoch = epoch
+            epochs_without_improvement = 0
+        else:
+            epochs_without_improvement += 1
+            
+        # Check if we should stop
+        if epochs_without_improvement >= patience:
+            return epoch, best_epoch
+            
+    # If we never hit the patience threshold, return the last epoch
+    return len(val_losses) - 1, best_epoch
+
+def test_early_stopping():
+    
+    losses1 = [0.9, 0.8, 0.75, 0.77, 0.76, 0.77, 0.78]
+    stop_epoch1, best_epoch1 = early_stopping(losses1, patience=2, min_delta=0.01)
+    assert stop_epoch1 == 4 and best_epoch1 == 2, "Test case 1 failed"
+
+    losses2 = [0.9, 0.8, 0.7, 0.6, 0.5]
+    stop_epoch2, best_epoch2 = early_stopping(losses2, patience=2, min_delta=0.01)
+    assert stop_epoch2 == 4 and best_epoch2 == 4, "Test case 2 failed"
+
+    losses3 = [0.9, 0.8, 0.79, 0.78, 0.77]
+    stop_epoch3, best_epoch3 = early_stopping(losses3, patience=2, min_delta=0.1)
+    assert stop_epoch3 == 4 and best_epoch3 == 2, "Test case 3 failed"
+
+    losses4 = [0.5, 0.4]
+    stop_epoch4, best_epoch4 = early_stopping(losses4, patience=3, min_delta=0.01)
+    assert stop_epoch4 == 1 and best_epoch4 == 1, "Test case 4 failed"
+
+    losses5 = [0.5, 0.4, 0.4, 0.4, 0.4]
+    stop_epoch5, best_epoch5 = early_stopping(losses5, patience=2, min_delta=0.01)
+    assert stop_epoch5 == 3 and best_epoch5 == 1, "Test case 5 failed"
+
+if __name__ == "__main__":
+    test_early_stopping()
+    print("All test cases passed!")