You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
**Bhattacharyya Distance (BD)** is a concept in statistics used to measure the **similarity** or **overlap** between two probability distributions **P(x)** and **Q(x)** on the same domain **x**.
7
+
8
+
This differs from **KL Divergence**, which measures the **loss of information** when projecting one probability distribution onto another (reference distribution).
9
+
10
+
### **Bhattacharyya Distance Formula**
11
+
The Bhattacharyya distance is defined as:
12
+
13
+
$$
14
+
BC (P, Q) = \sum \sqrt{P(X) \cdot Q(X)}
15
+
$$
16
+
17
+
$$
18
+
BD (P, Q) = -\ln(BC (P, Q))
19
+
$$
20
+
21
+
where **BC (P, Q)** is the **Bhattacharyya coefficient**.
22
+
23
+
### **Key Properties**
24
+
1.**BD is always non-negative**:
25
+
$$ BD \geq 0 $$
26
+
2.**Symmetric in nature**:
27
+
$$ BD (P, Q) = BD (Q, P) $$
28
+
3.**Applications**:
29
+
- Risk assessment
30
+
- Stock predictions
31
+
- Feature scaling
32
+
- Classification problems
33
+
34
+
### **Example Calculation**
35
+
Consider two probability distributions **P(x)** and **Q(x)**:
A new study (https://arxiv.org/pdf/2503.10622) demonstrates that layer normalization, that is ubiquitous in transformers, produces Tanh-like S-shapes. By incorporating a new layer replacement for normalization called "Dynamic Tanh" (DyT for short), Transformers without normalization can match or exceed the performance of their normalized counterparts, mostly without hyperparameter tuning.
2
+
3
+
### Normalization layer
4
+
Consider an standard NLP task, where an input $x$ has a shape of $(B,T,C)$, where $B$ is the batch size, $T$ - number of tokens (sequence length) and $C$ - embedding dimensions. Then an output of a normalization layer is generally computed as $norm(x)=\gamma(\frac{x-\mu}{\sqrt{\sigma^2+\varepsilon}})+\beta$, where $\gamma$ and $\beta$ are learnable parameters of shape $(C,)$. Distribution's statistics are calculated as follows: $\mu_k=\frac{1}{BT}\sum_i^B\sum_j^Tx_{ij}$; $\sigma_k^2=\frac{1}{B T} \sum_{i, j}\left(x_{i j k}-\mu_k\right)^2$
5
+
6
+
### Hyperboloic tangent (Tanh)
7
+
Tanh function is defined as a ratio: $tanh(x)=\frac{sinh(x)}{cosh(x)}=\frac{exp(x)-exp(-x)}{exp(x)+exp(-x)}$. Essentially the function allows transformation of an arbitrary domain to $[-1,1]$.
8
+
9
+
### Dynamic Tanh (DyT)
10
+
Turns out that LN (layer normalization) produces different parts of a $tanh(kx)$, where $k$ controls the curvature of the tanh curve in the center. The smaller the $k$, the smoother is the change from $-1$ to $1$. Hence the study proposes a drop-in replacement for LN given an input tensor $x$:
11
+
12
+
$$
13
+
DyT(x)=\gamma*tanh(\alpha x)+\beta,
14
+
$$
15
+
16
+
where:
17
+
* $\alpha$ - learnable parameter that allows scaling the input differently based on its range (tokens producing **smaller variance** produce **less smoother curves**). Authors suggest a **default value** of $0.5$.
18
+
* $\gamma, \beta$ - learnable parameters, that scale our output based on the input. Authors suggest initializing these vectors with following **default values**:
19
+
* $\gamma$ as all-one vector
20
+
* $\beta$ as all-zero
21
+
22
+
Despite not calculating statistics, DyT preserves the "squashing" effect of LN on extreme values in a non-linear fashion, while almost linearly transforming central parts of the input.
Cross-entropy loss, also known as log loss, measures the performance of a classification model whose output is a probability value between 0 and 1. For multi-class classification tasks, we use the categorical cross-entropy loss.
4
+
5
+
### Mathematical Background
6
+
7
+
For a single sample with C classes, the categorical cross-entropy loss is defined as:
8
+
9
+
$L = -\sum_{c=1}^{C} y_c \log(p_c)$
10
+
11
+
where:
12
+
13
+
- $y_c$ is a binary indicator (0 or 1) if class label c is the correct classification for the sample
14
+
- $p_c$ is the predicted probability that the sample belongs to class c
15
+
- $C$ is the number of classes
16
+
17
+
### Implementation Requirements
18
+
19
+
Your task is to implement a function that computes the average cross-entropy loss across multiple samples:
Early stopping is a regularization technique that helps prevent overfitting in machine learning models. Your task is to implement the early stopping decision logic based on the validation loss history.
4
+
5
+
### Problem Description
6
+
7
+
Given a sequence of validation losses from model training, determine if training should be stopped based on the following criteria:
8
+
9
+
- Training should stop if the validation loss hasn't improved (decreased) for a specified number of epochs (patience)
10
+
- An improvement is only counted if the loss decreases by more than a minimum threshold (min_delta)
11
+
- The best model is the one with the lowest validation loss
12
+
13
+
### Example
14
+
15
+
Consider the following validation losses: [0.9, 0.8, 0.75, 0.77, 0.76, 0.77, 0.78]
16
+
17
+
- With patience=2 and min_delta=0.01:
18
+
- Best loss is 0.75 at epoch 2
19
+
- No improvement > 0.01 for next 2 epochs
20
+
- Should stop at epoch 4
21
+
22
+
### Function Requirements
23
+
24
+
- Return both the epoch to stop at and the best epoch
0 commit comments