Open-Deep-ML
diff --git a/‎Problems/120_bhattacharya_distance/learn.md‎
Lines changed: 55 additions & 0 deletions b/‎Problems/120_bhattacharya_distance/learn.md‎
Lines changed: 55 additions & 0 deletions
diff --git a/‎Problems/120_bhattacharya_distance/solution.py‎
Lines changed: 49 additions & 0 deletions b/‎Problems/120_bhattacharya_distance/solution.py‎
Lines changed: 49 additions & 0 deletions
diff --git a/‎Problems/128_dyt/learn.md‎
Lines changed: 22 additions & 0 deletions b/‎Problems/128_dyt/learn.md‎
Lines changed: 22 additions & 0 deletions
diff --git a/‎Problems/128_dyt/solution.py‎
Lines changed: 85 additions & 0 deletions b/‎Problems/128_dyt/solution.py‎
Lines changed: 85 additions & 0 deletions
diff --git a/‎Problems/398_gridworld_policy_evaluation/learn.html‎
Lines changed: 38 additions & 0 deletions b/‎Problems/398_gridworld_policy_evaluation/learn.html‎
Lines changed: 38 additions & 0 deletions
diff --git a/‎Problems/398_gridworld_policy_evaluation/learn.md‎
Lines changed: 35 additions & 0 deletions b/‎Problems/398_gridworld_policy_evaluation/learn.md‎
Lines changed: 35 additions & 0 deletions
diff --git a/‎Problems/398_gridworld_policy_evaluation/question.md‎
Lines changed: 32 additions & 0 deletions b/‎Problems/398_gridworld_policy_evaluation/question.md‎
Lines changed: 32 additions & 0 deletions
@@ -0,0 +1,55 @@
+
+# Learn Section
+
+## Understanding Bhattacharyya Distance
+
+**Bhattacharyya Distance (BD)** is a concept in statistics used to measure the **similarity** or **overlap** between two probability distributions **P(x)** and **Q(x)** on the same domain **x**.  
+
+This differs from **KL Divergence**, which measures the **loss of information** when projecting one probability distribution onto another (reference distribution).  
+
+### **Bhattacharyya Distance Formula**
+The Bhattacharyya distance is defined as:  
+
+$$
+BC (P, Q) = \sum \sqrt{P(X) \cdot Q(X)}
+$$
+
+$$
+BD (P, Q) = -\ln(BC (P, Q))
+$$
+
+where **BC (P, Q)** is the **Bhattacharyya coefficient**.  
+
+### **Key Properties**
+1. **BD is always non-negative**:  
+   $$ BD \geq 0 $$
+2. **Symmetric in nature**:  
+   $$ BD (P, Q) = BD (Q, P) $$
+3. **Applications**:  
+   - Risk assessment  
+   - Stock predictions  
+   - Feature scaling  
+   - Classification problems  
+
+### **Example Calculation**
+Consider two probability distributions **P(x)** and **Q(x)**:  
+
+$$
+P(x) = [0.1, 0.2, 0.3, 0.4], \quad Q(x) = [0.4, 0.3, 0.2, 0.1]
+$$
+
+1. **Bhattacharyya Coefficient**:  
+
+$$
+BC (P, Q) = \sum \sqrt{P(X) \cdot Q(X)} = 0.8898
+$$
+
+2. **Bhattacharyya Distance**:  
+
+$$
+BD (P, Q) = -\ln(BC (P, Q)) = -\ln(0.8898) = 0.1166
+$$
+
+This illustrates how BD quantifies the **overlap** between two probability distributions.  
+
+    
@@ -0,0 +1,49 @@
+import numpy as np
+
+def bhattacharyya_distance(p : list[float], q : list[float]) -> float:
+
+    if len(p) != len(q) :
+        return 0.0
+    
+    p, q = np.array(p), np.array(q)
+
+    BC = np.sum(np.sqrt(p * q))    #### Bhattacharya coefficient
+
+    DB = -np.log(BC)               #### Bhattacharya distance
+
+    return round(DB, 4)
+
+def test_bhattacharyya_distance() -> None:
+
+    # Test Case 1
+    p = [0.1, 0.2, 0.3, 0.4]
+    q = [0.4, 0.3, 0.2, 0.1]
+    assert bhattacharyya_distance(p, q) == 0.1166
+
+    # Test Case 2
+    p = [0.7, 0.2, 0.1]
+    q = [0.4, 0.3, 0.3]
+    assert bhattacharyya_distance(p, q) == 0.0541
+
+    # Test Case 3
+    p = []
+    q = [0.5, 0.4, 0.1]
+    assert bhattacharyya_distance(p, q) == 0.0
+
+    # Test Case 4
+    p = [0.6, 0.4]
+    q = [0.1, 0.7, 0.2]
+    assert bhattacharyya_distance(p, q) == 0.0
+
+    # Test Case 5
+    p = [0.6, 0.2, 0.1, 0.1]
+    q = [0.1, 0.2, 0.3, 0.4]
+    assert bhattacharyya_distance(p, q) == 0.2007
+
+if __name__ == '__main__':
+
+    test_bhattacharyya_distance()
+    print('All Bhattacharyya Distance test cases passed')
+
+
+
@@ -0,0 +1,22 @@
+A new study (https://arxiv.org/pdf/2503.10622) demonstrates that layer normalization, that is ubiquitous in transformers, produces Tanh-like S-shapes. By incorporating a new layer replacement for normalization called "Dynamic Tanh" (DyT for short), Transformers without normalization can match or exceed the performance of their normalized counterparts, mostly without hyperparameter tuning.
+
+### Normalization layer
+Consider an standard NLP task, where an input $x$ has a shape of $(B,T,C)$, where $B$ is the batch size, $T$ - number of tokens (sequence length) and $C$ - embedding dimensions. Then an output of a normalization layer is generally computed as $norm(x)=\gamma(\frac{x-\mu}{\sqrt{\sigma^2+\varepsilon}})+\beta$, where $\gamma$ and $\beta$ are learnable parameters of shape $(C,)$. Distribution's statistics are calculated as follows: $\mu_k=\frac{1}{BT}\sum_i^B\sum_j^Tx_{ij}$; $\sigma_k^2=\frac{1}{B T} \sum_{i, j}\left(x_{i j k}-\mu_k\right)^2$
+
+### Hyperboloic tangent (Tanh)
+Tanh function is defined as a ratio: $tanh(x)=\frac{sinh(x)}{cosh(x)}=\frac{exp(x)-exp(-x)}{exp(x)+exp(-x)}$. Essentially the function allows transformation of an arbitrary domain to $[-1,1]$. 
+
+### Dynamic Tanh (DyT)
+Turns out that LN (layer normalization) produces different parts of a $tanh(kx)$, where $k$ controls the curvature of the tanh curve in the center. The smaller the $k$, the smoother is the change from $-1$ to $1$. Hence the study proposes a drop-in replacement for LN given an input tensor $x$:
+
+$$
+DyT(x)=\gamma*tanh(\alpha x)+\beta,
+$$
+
+where:
+* $\alpha$ - learnable parameter that allows scaling the input differently based on its range (tokens producing **smaller variance** produce **less smoother curves**). Authors suggest a **default value** of $0.5$.
+* $\gamma, \beta$ - learnable parameters, that scale our output based on the input. Authors suggest initializing these vectors with following **default values**:
+    * $\gamma$ as all-one vector 
+    * $\beta$ as all-zero
+
+Despite not calculating statistics, DyT preserves the "squashing" effect of LN on extreme values in a non-linear fashion, while almost linearly transforming central parts of the input.
@@ -0,0 +1,85 @@
+import numpy as np
+
+
+def dynamic_tanh(x: np.ndarray, alpha: float, gamma: float, beta: float) -> list[float]:
+    """
+    Applies DyT to an array. Could serve as a replacement
+    for layer normalization in Transformers. 
+
+    Parameters
+    ----------
+    x : np.ndarray
+        Input tensor of shape (B,T,C)
+    alpha : float
+        Learnable parameter of the DyT layer
+    gamma : float 
+        Learnable scaling parameter vector of shape (C, ) of the DyT layer
+    beta : float
+        Learnable scaling parameter vector of shape (C, ) of the DyT layer
+    eps : float
+        Epsilon constant
+
+    Returns
+    -------
+    x : list[float]
+        Input x with DyT applied to it and rounded up to 4 floating points
+    """
+
+    def tanh(x: np.ndarray) -> np.ndarray:
+        return (np.exp(x) - np.exp(-x)) / (np.exp(x) + np.exp(-x))
+
+    x = tanh(alpha * x)
+    return (x * gamma + beta).round(4).tolist()
+
+
+def test_dynamic_tanh():
+    alpha = .5
+
+    # Test 1
+    x = np.array([[[0.14115588, 0.00372817, 0.24126647, 0.22183601],
+        [0.36301332, 0.67681456, 0.3723281 , 0.62767559],
+        [0.94926205, 0.80230257, 0.19737574, 0.04460771],
+        [0.43777021, 0.95744001, 0.60795979, 0.58980314],
+        [0.27250625, 0.48053656, 0.11087151, 0.06228769]],
+       [[0.12620219, 0.63002473, 0.75673539, 0.60411435],
+        [0.3918192 , 0.39810709, 0.42186426, 0.79954607],
+        [0.67730682, 0.96539769, 0.13366266, 0.44462357],
+        [0.31556188, 0.86050486, 0.96060468, 0.43953706],
+        [0.80002165, 0.39582123, 0.35731605, 0.83600622]]])
+    gamma, beta = np.ones(shape=(x.shape[2])), np.zeros(shape=(x.shape[2]))
+    expected_x = [[[0.0705, 0.0019, 0.1201, 0.1105],
+        [0.1795, 0.3261, 0.184, 0.3039],
+        [0.4419, 0.3809, 0.0984, 0.0223],
+        [0.2155, 0.4452, 0.295, 0.2866],
+        [0.1354, 0.2357, 0.0554, 0.0311]],
+        [[0.063, 0.305, 0.3613, 0.2932],
+        [0.1934, 0.1965, 0.2079, 0.3798],
+        [0.3263, 0.4484, 0.0667, 0.2187],
+        [0.1565, 0.4055, 0.4465, 0.2163],
+        [0.38, 0.1954, 0.1768, 0.3952]]]
+    output_x = dynamic_tanh(x, alpha, gamma, beta)
+    assert expected_x == output_x, 'Test case 1 failed'
+
+    # Test 2
+    x = np.array([[[0.20793482, 0.16989285, 0.03898972],
+        [0.17912554, 0.10962205, 0.3870742],
+        [0.00107181, 0.35807922, 0.15861333]]])
+    gamma, beta = np.ones(shape=(x.shape[2])), np.zeros(shape=(x.shape[2]))
+    expected_x = [[[0.1036, 0.0847, 0.0195],
+        [0.0893, 0.0548, 0.1912],
+        [0.0005, 0.1772, 0.0791]]]
+    output_x = dynamic_tanh(x, alpha, gamma, beta)
+    assert expected_x == output_x, 'Test case 2 failed'
+
+    # Test 3
+    x = np.array([[[0.94378259]],[[0.97754654]],[[0.36168351]],[[0.51821078]],[[0.76961589]]])
+    gamma, beta = np.ones(shape=(x.shape[2])), np.zeros(shape=(x.shape[2]))
+    expected_x = [[[0.4397]],[[0.4532]],[[0.1789]],[[0.2535]],[[0.3669]]]
+    output_x = dynamic_tanh(x, alpha, gamma, beta)
+    assert expected_x == output_x, 'Test case 3 failed'
+
+    print('All tests passed')
+
+
+if __name__ == '__main__':
+    test_dynamic_tanh()
@@ -0,0 +1,38 @@
+<!DOCTYPE html>
+<html>
+<head>
+  <meta charset="UTF-8">
+  <title>Gridworld Policy Evaluation</title>
+</head>
+<body>
+  <h2>Gridworld Policy Evaluation</h2>
+  <p>
+    In reinforcement learning, <strong>policy evaluation</strong> is the process of computing the state-value function for a given policy. In a gridworld environment, this involves iteratively updating the value of each state based on the expected return from following the policy.
+  </p>
+  
+  <h3>Key Concepts</h3>
+  <ul>
+    <li><strong>State-Value Function (V):</strong> The expected return when starting from a state and following a policy.</li>
+    <li><strong>Policy:</strong> A mapping from states to probabilities for each available action.</li>
+    <li><strong>Bellman Expectation Equation:</strong>
+      <p>
+        For each state \( s \):<br>
+        \[
+        V(s) = \sum_{a} \pi(a|s) \sum_{s'} P(s'|s,a) [R(s,a,s') + \gamma V(s')]
+        \]
+      </p>
+    </li>
+  </ul>
+
+  <h3>Algorithm Overview</h3>
+  <ol>
+    <li><strong>Initialization:</strong> Start with an initial guess (e.g., zeros) for the state-value function \( V(s) \).</li>
+    <li><strong>Iterative Update:</strong> Update the state value for each non-terminal state using the Bellman equation until the maximum change is less than a set threshold.</li>
+    <li><strong>Terminal States:</strong> For this task, terminal states (the four corners) remain unchanged.</li>
+  </ol>
+
+  <p>
+    This method provides a foundation for assessing the quality of states under a given policy, which is crucial for many reinforcement learning techniques.
+  </p>
+</body>
+</html>
@@ -0,0 +1,35 @@
+# Gridworld Policy Evaluation
+
+In reinforcement learning, **policy evaluation** is the process of computing the state-value function for a given policy. For a gridworld environment, this involves iteratively updating the value of each state based on the expected return following the policy.
+
+## Key Concepts
+
+- **State-Value Function (V):**  
+  The expected return when starting from a state and following a given policy.
+
+- **Policy:**  
+  A mapping from states to probabilities of selecting each available action.
+
+- **Bellman Expectation Equation:**  
+  For each state $s$:
+  $$
+  V(s) = \sum_{a} \pi(a|s) \sum_{s'} P(s'|s,a) [R(s,a,s') + \gamma V(s')]
+  $$
+  where:
+  - $ \pi(a|s) $ is the probability of taking action $ a $ in state $ s $,
+  - $ P(s'|s,a) $ is the probability of transitioning to state $ s' $,
+  - $ R(s,a,s') $ is the reward for that transition,
+  - $ \gamma $ is the discount factor.
+
+## Algorithm Overview
+
+1. **Initialization:**  
+   Start with an initial guess (commonly zeros) for the state-value function $ V(s) $.
+
+2. **Iterative Update:**  
+   For each non-terminal state, update the state value using the Bellman expectation equation. Continue updating until the maximum change in value (delta) is less than a given threshold.
+
+3. **Terminal States:**  
+   For this example, the four corners of the grid are considered terminal, so their values remain unchanged.
+
+This evaluation method is essential for understanding how "good" each state is under a specific policy, and it forms the basis for more advanced reinforcement learning algorithms.
@@ -0,0 +1,32 @@
+# Gridworld Policy Evaluation
+
+Implement a function that evaluates the state-value function for a 5x5 gridworld under a given policy. In this gridworld, the agent can move in four directions: up, down, left, and right. Each move incurs a constant reward of -1, and terminal states (the four corners) remain unchanged. The policy is provided as a dictionary mapping each state (tuple: (row, col)) to a dictionary of action probabilities.
+
+## Example
+
+**Input:**
+
+```python
+policy = {
+    (i, j): {'up': 0.25, 'down': 0.25, 'left': 0.25, 'right': 0.25}
+    for i in range(5) for j in range(5)
+}
+gamma = 0.9
+threshold = 0.001
+```
+
+## Output:
+
+A 5x5 list of state values that converges after iterative evaluation.
+
+```
+[0.0, -4.864480919478529, -6.078955203735765, -4.864480919478529, 0.0]
+[-4.864480919478529, -6.23388594292537, -6.7676569349718365, -6.233885942925371, -4.864480919478529]
+[-6.078955203735764, -6.7676569349718365, -7.090189335232064, -6.7676569349718365, -6.078955203735764]
+[-4.864480919478529, -6.23388594292537, -6.7676569349718365, -6.233885942925371, -4.864480919478529]
+[0.0, -4.864480919478529, -6.078955203735765, -4.864480919478529, 0.0]
+```
+
+## Reasoning:
+
+For each non-terminal state, compute the expected value over all possible actions using the policy. Update the state value iteratively using the Bellman expectation equation until the maximum change across states is below the threshold, ensuring that terminal states remain fixed.