You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
**Bhattacharyya Distance (BD)** is a concept in statistics used to measure the **similarity** or **overlap** between two probability distributions **P(x)** and **Q(x)** on the same domain **x**.
7
+
8
+
This differs from **KL Divergence**, which measures the **loss of information** when projecting one probability distribution onto another (reference distribution).
9
+
10
+
### **Bhattacharyya Distance Formula**
11
+
The Bhattacharyya distance is defined as:
12
+
13
+
$$
14
+
BC (P, Q) = \sum \sqrt{P(X) \cdot Q(X)}
15
+
$$
16
+
17
+
$$
18
+
BD (P, Q) = -\ln(BC (P, Q))
19
+
$$
20
+
21
+
where **BC (P, Q)** is the **Bhattacharyya coefficient**.
22
+
23
+
### **Key Properties**
24
+
1.**BD is always non-negative**:
25
+
$$ BD \geq 0 $$
26
+
2.**Symmetric in nature**:
27
+
$$ BD (P, Q) = BD (Q, P) $$
28
+
3.**Applications**:
29
+
- Risk assessment
30
+
- Stock predictions
31
+
- Feature scaling
32
+
- Classification problems
33
+
34
+
### **Example Calculation**
35
+
Consider two probability distributions **P(x)** and **Q(x)**:
A new study (https://arxiv.org/pdf/2503.10622) demonstrates that layer normalization, that is ubiquitous in transformers, produces Tanh-like S-shapes. By incorporating a new layer replacement for normalization called "Dynamic Tanh" (DyT for short), Transformers without normalization can match or exceed the performance of their normalized counterparts, mostly without hyperparameter tuning.
2
+
3
+
### Normalization layer
4
+
Consider an standard NLP task, where an input $x$ has a shape of $(B,T,C)$, where $B$ is the batch size, $T$ - number of tokens (sequence length) and $C$ - embedding dimensions. Then an output of a normalization layer is generally computed as $norm(x)=\gamma(\frac{x-\mu}{\sqrt{\sigma^2+\varepsilon}})+\beta$, where $\gamma$ and $\beta$ are learnable parameters of shape $(C,)$. Distribution's statistics are calculated as follows: $\mu_k=\frac{1}{BT}\sum_i^B\sum_j^Tx_{ij}$; $\sigma_k^2=\frac{1}{B T} \sum_{i, j}\left(x_{i j k}-\mu_k\right)^2$
5
+
6
+
### Hyperboloic tangent (Tanh)
7
+
Tanh function is defined as a ratio: $tanh(x)=\frac{sinh(x)}{cosh(x)}=\frac{exp(x)-exp(-x)}{exp(x)+exp(-x)}$. Essentially the function allows transformation of an arbitrary domain to $[-1,1]$.
8
+
9
+
### Dynamic Tanh (DyT)
10
+
Turns out that LN (layer normalization) produces different parts of a $tanh(kx)$, where $k$ controls the curvature of the tanh curve in the center. The smaller the $k$, the smoother is the change from $-1$ to $1$. Hence the study proposes a drop-in replacement for LN given an input tensor $x$:
11
+
12
+
$$
13
+
DyT(x)=\gamma*tanh(\alpha x)+\beta,
14
+
$$
15
+
16
+
where:
17
+
* $\alpha$ - learnable parameter that allows scaling the input differently based on its range (tokens producing **smaller variance** produce **less smoother curves**). Authors suggest a **default value** of $0.5$.
18
+
* $\gamma, \beta$ - learnable parameters, that scale our output based on the input. Authors suggest initializing these vectors with following **default values**:
19
+
* $\gamma$ as all-one vector
20
+
* $\beta$ as all-zero
21
+
22
+
Despite not calculating statistics, DyT preserves the "squashing" effect of LN on extreme values in a non-linear fashion, while almost linearly transforming central parts of the input.
In reinforcement learning, <strong>policy evaluation</strong> is the process of computing the state-value function for a given policy. In a gridworld environment, this involves iteratively updating the value of each state based on the expected return from following the policy.
11
+
</p>
12
+
13
+
<h3>Key Concepts</h3>
14
+
<ul>
15
+
<li><strong>State-Value Function (V):</strong> The expected return when starting from a state and following a policy.</li>
16
+
<li><strong>Policy:</strong> A mapping from states to probabilities for each available action.</li>
<li><strong>Initialization:</strong> Start with an initial guess (e.g., zeros) for the state-value function \( V(s) \).</li>
30
+
<li><strong>Iterative Update:</strong> Update the state value for each non-terminal state using the Bellman equation until the maximum change is less than a set threshold.</li>
31
+
<li><strong>Terminal States:</strong> For this task, terminal states (the four corners) remain unchanged.</li>
32
+
</ol>
33
+
34
+
<p>
35
+
This method provides a foundation for assessing the quality of states under a given policy, which is crucial for many reinforcement learning techniques.
In reinforcement learning, **policy evaluation** is the process of computing the state-value function for a given policy. For a gridworld environment, this involves iteratively updating the value of each state based on the expected return following the policy.
4
+
5
+
## Key Concepts
6
+
7
+
-**State-Value Function (V):**
8
+
The expected return when starting from a state and following a given policy.
9
+
10
+
-**Policy:**
11
+
A mapping from states to probabilities of selecting each available action.
- $ \pi(a|s) $ is the probability of taking action $ a $ in state $ s $,
20
+
- $ P(s'|s,a) $ is the probability of transitioning to state $ s' $,
21
+
- $ R(s,a,s') $ is the reward for that transition,
22
+
- $ \gamma $ is the discount factor.
23
+
24
+
## Algorithm Overview
25
+
26
+
1.**Initialization:**
27
+
Start with an initial guess (commonly zeros) for the state-value function $ V(s) $.
28
+
29
+
2.**Iterative Update:**
30
+
For each non-terminal state, update the state value using the Bellman expectation equation. Continue updating until the maximum change in value (delta) is less than a given threshold.
31
+
32
+
3.**Terminal States:**
33
+
For this example, the four corners of the grid are considered terminal, so their values remain unchanged.
34
+
35
+
This evaluation method is essential for understanding how "good" each state is under a specific policy, and it forms the basis for more advanced reinforcement learning algorithms.
Implement a function that evaluates the state-value function for a 5x5 gridworld under a given policy. In this gridworld, the agent can move in four directions: up, down, left, and right. Each move incurs a constant reward of -1, and terminal states (the four corners) remain unchanged. The policy is provided as a dictionary mapping each state (tuple: (row, col)) to a dictionary of action probabilities.
For each non-terminal state, compute the expected value over all possible actions using the policy. Update the state value iteratively using the Bellman expectation equation until the maximum change across states is below the threshold, ensuring that terminal states remain fixed.
0 commit comments