Skip to content

Commit 2495768

Browse files
authored
Merge pull request #373 from nzomi/main
Add batch&group&instance&layer normalization
2 parents 2c079bf + 14be5e8 commit 2495768

File tree

8 files changed

+712
-0
lines changed

8 files changed

+712
-0
lines changed
Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
## Understanding Batch Normalization
2+
3+
Batch Normalization (BN) is a widely used technique that helps to accelerate the training of deep neural networks and improve model performance. By normalizing the inputs to each layer so that they have a mean of zero and a variance of one, BN stabilizes the learning process, speeds up convergence, and introduces regularization, which can reduce the need for other forms of regularization like dropout.
4+
5+
### Concepts
6+
7+
Batch Normalization operates on the principle of reducing **internal covariate shift**, which occurs when the distribution of inputs to a layer changes during training as the model weights get updated. This can slow down training and make hyperparameter tuning more challenging. By normalizing the inputs, BN reduces this problem, allowing the model to train faster and more reliably.
8+
9+
The process of Batch Normalization consists of the following steps:
10+
11+
1. **Compute the Mean and Variance:** For each mini-batch, compute the mean and variance of the activations for each feature (dimension).
12+
2. **Normalize the Inputs:** Normalize the activations using the computed mean and variance.
13+
3. **Apply Scale and Shift:** After normalization, apply a learned scale (gamma) and shift (beta) to restore the model's ability to represent the data's original distribution.
14+
4. **Training and Inference:** During training, the mean and variance are computed from the current mini-batch. During inference, a running average of the statistics from the training phase is used.
15+
16+
### Structure of Batch Normalization for BCHW Input
17+
18+
For an input tensor with the shape **BCHW**, where:
19+
- **B**: batch size,
20+
- **C**: number of channels,
21+
- **H**: height,
22+
- **W**: width,
23+
the Batch Normalization process operates on specific dimensions based on the task's requirement.
24+
25+
#### 1. Mean and Variance Calculation
26+
27+
- In **Batch Normalization**, we typically normalize the activations **across the batch** and **over the spatial dimensions (height and width)** for each **channel**. This means we calculate the mean and variance **per channel** (C) for the **batch and spatial dimensions** (H, W).
28+
29+
For each channel $c$, we compute the **mean** $\mu_c$ and **variance** $\sigma_c^2$ over the mini-batch and spatial dimensions:
30+
31+
$$
32+
\mu_c = \frac{1}{B \cdot H \cdot W} \sum_{i=1}^{B} \sum_{h=1}^{H} \sum_{w=1}^{W} x_{i,c,h,w}
33+
$$
34+
35+
$$
36+
\sigma_c^2 = \frac{1}{B \cdot H \cdot W} \sum_{i=1}^{B} \sum_{h=1}^{H} \sum_{w=1}^{W} (x_{i,c,h,w} - \mu_c)^2
37+
$$
38+
39+
Where:
40+
- $x_{i,c,h,w}$ is the input activation at batch index $i$, channel $c$, height $h$, and width $w$.
41+
- $B$ is the batch size.
42+
- $H$ and $W$ are the spatial dimensions (height and width).
43+
- $C$ is the number of channels.
44+
45+
The mean and variance are computed **over all spatial positions (H, W)** and **across all samples in the batch (B)** for each **channel (C)**.
46+
47+
#### 2. Normalization
48+
49+
Once the mean $\mu_c$ and variance $\sigma_c^2$ have been computed for each channel, the next step is to **normalize** the input. The normalization is done by subtracting the mean and dividing by the standard deviation (plus a small constant $\epsilon$ for numerical stability):
50+
51+
$$
52+
\hat{x}_{i,c,h,w} = \frac{x_{i,c,h,w} - \mu_c}{\sqrt{\sigma_c^2 + \epsilon}}
53+
$$
54+
55+
Where:
56+
- $\hat{x}_{i,c,h,w}$ is the normalized activation for the input at batch index $i$, channel $c$, height $h$, and width $w$.
57+
- $\epsilon$ is a small constant to avoid division by zero (for numerical stability).
58+
59+
#### 3. Scale and Shift
60+
61+
After normalization, the next step is to apply a **scale** ($\gamma_c$) and **shift** ($\beta_c$) to the normalized activations for each channel. These learned parameters allow the model to adjust the output distribution of each feature, preserving the flexibility of the original activations.
62+
63+
$$
64+
y_{i,c,h,w} = \gamma_c \hat{x}_{i,c,h,w} + \beta_c
65+
$$
66+
67+
Where:
68+
- $\gamma_c$ is the scaling factor for channel $c$.
69+
- $\beta_c$ is the shifting factor for channel $c$.
70+
71+
#### 4. Training and Inference
72+
73+
- **During Training**: The mean and variance are computed for each mini-batch and used for normalization across the batch and spatial dimensions for each channel.
74+
- **During Inference**: The model uses a running average of the statistics (mean and variance) that were computed during training to ensure consistent behavior when deployed.
75+
76+
### Key Points
77+
78+
- **Normalization Across Batch and Spatial Dimensions**: In Batch Normalization for **BCHW** input, the normalization is done **across the batch (B) and spatial dimensions (H, W)** for each **channel (C)**. This ensures that each feature channel has zero mean and unit variance, making the training process more stable.
79+
80+
- **Channel-wise Normalization**: Batch Normalization normalizes the activations independently for each **channel (C)** because different channels in convolutional layers often have different distributions and should be treated separately.
81+
82+
- **Numerical Stability**: The small constant $\epsilon$ is added to the variance to avoid numerical instability when dividing by the square root of variance, especially when the variance is very small.
83+
84+
- **Improved Gradient Flow**: By reducing internal covariate shift, Batch Normalization allows the gradients to flow more easily during backpropagation, helping the model train faster and converge more reliably.
85+
86+
- **Regularization Effect**: Batch Normalization introduces noise into the training process because it relies on the statistics of a mini-batch. This noise acts as a form of regularization, which can prevent overfitting and improve generalization.
87+
88+
### Why Normalize Over Batch and Spatial Dimensions?
89+
90+
- **Across Batch**: Normalizing across the batch helps to stabilize the input distribution across all samples in a mini-batch. This allows the model to avoid the problem of large fluctuations in the input distribution as weights are updated.
91+
92+
- **Across Spatial Dimensions**: In convolutional networks, the spatial dimensions (height and width) are highly correlated, and normalizing over these dimensions ensures that the activations are distributed consistently throughout the spatial field, helping to maintain a stable learning process.
93+
94+
- **Channel-wise Normalization**: Each channel can have its own distribution of values, and normalization per channel ensures that each feature map is scaled and shifted independently, allowing the model to learn representations that are not overly sensitive to specific channels' scaling.
95+
96+
By normalizing across the batch and spatial dimensions and applying a per-channel transformation, Batch Normalization helps reduce internal covariate shift and speeds up training, leading to faster convergence and better overall model performance.
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
import numpy as np
2+
3+
def batch_normalization(X: np.ndarray, gamma: np.ndarray, beta: np.ndarray, epsilon: float = 1e-5) -> np.ndarray:
4+
"""
5+
Perform Batch Normalization.
6+
7+
Args:
8+
X: numpy array of shape (B, C, H, W), input data
9+
gamma: numpy array of shape (C,), scale parameter
10+
beta: numpy array of shape (C,), shift parameter
11+
epsilon: small constant to avoid division by zero
12+
13+
Returns:
14+
norm_X: numpy array of shape (B, C, H, W), normalized output
15+
"""
16+
# Compute mean and variance across the batch and spatial dimensions
17+
mean = np.mean(X, axis=(0, 2, 3), keepdims=True) # Mean over (B, H, W)
18+
variance = np.var(X, axis=(0, 2, 3), keepdims=True) # Variance over (B, H, W)
19+
20+
# Normalize
21+
X_norm = (X - mean) / np.sqrt(variance + epsilon)
22+
23+
# Scale and shift
24+
norm_X = gamma * X_norm + beta
25+
return norm_X
26+
27+
# Test cases for batch normalization
28+
def test_batch_normalizations():
29+
# Test case: Batch Normalization
30+
B, C, H, W = 2, 2, 2, 2 # Batch size, Channels, Height, Width
31+
np.random.seed(42)
32+
X = np.random.randn(B, C, H, W)
33+
gamma = np.ones(C).reshape(1, C, 1, 1)
34+
beta = np.zeros(C).reshape(1, C, 1, 1)
35+
36+
# Test batch normalization
37+
actual_output = batch_normalization(X, gamma, beta)
38+
expected_output = [[[[ 0.42859934, -0.51776438],
39+
[ 0.65360963, 1.95820707]],
40+
[[ 0.02353721, 0.02355215],
41+
[ 1.67355207, 0.93490043]]],
42+
[[[-1.01139563, 0.49692747],
43+
[-1.00236882, -1.00581468]],
44+
[[ 0.45676349, -1.50433085],
45+
[-1.33293647, -0.27503802]]]]
46+
np.testing.assert_array_almost_equal(actual_output, expected_output, decimal=6, err_msg="Test case 1 failed")
47+
48+
# Test different input
49+
np.random.seed(101)
50+
X = np.random.randn(B, C, H, W)
51+
actual_output = batch_normalization(X, gamma, beta)
52+
expected_output = [[[[ 1.81773164, 0.16104096],
53+
[ 0.38406453, 0.06197112]],
54+
[[ 1.00432932 ,-0.37139956],
55+
[-1.12098938, 0.94031919]]],
56+
[[[-1.94800122, 0.25029395],
57+
[ 0.08188579, -0.80898678]],
58+
[[ 0.34878049, -0.99452891],
59+
[-1.24171594, 1.43520478]]]]
60+
np.testing.assert_array_almost_equal(actual_output, expected_output, decimal=6, err_msg="Test case 2 failed")
61+
62+
# Test different params
63+
gamma = np.ones(C).reshape(1, C, 1, 1) * 0.5
64+
beta = np.ones(C).reshape(1, C, 1, 1)
65+
actual_output = batch_normalization(X, gamma, beta)
66+
expected_output = [[[[1.90886582, 1.08052048],
67+
[1.19203227, 1.03098556]],
68+
[[1.50216466, 0.81430022],
69+
[0.43950531, 1.4701596 ]]],
70+
[[[0.02599939, 1.12514697],
71+
[1.04094289, 0.59550661]],
72+
[[1.17439025, 0.50273554],
73+
[0.37914203, 1.71760239]]]]
74+
np.testing.assert_array_almost_equal(actual_output, expected_output, decimal=6, err_msg="Test case 3 failed")
75+
76+
if __name__ == "__main__":
77+
test_batch_normalizations()
78+
print("All normalization tests passed.")
Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
## Understanding Group Normalization
2+
3+
Group Normalization (GN) is a normalization technique that divides the channels into groups and normalizes the activations within each group. Unlike Batch Normalization, which normalizes over the entire mini-batch, Group Normalization normalizes over groups of channels and is less dependent on the batch size. This makes it particularly useful for tasks with small batch sizes or when using architectures such as segmentation networks where spatial resolution is important.
4+
5+
### Concepts
6+
7+
Group Normalization operates on the principle of normalizing within smaller groups of channels. The process reduces **internal covariate shift** within these groups and helps stabilize training, especially in scenarios where the batch size is small or varies across tasks.
8+
9+
The process of Group Normalization consists of the following steps:
10+
11+
1. **Divide Channels into Groups:** Split the feature channels into several groups. The number of groups is determined by the **n_groups** parameter.
12+
2. **Compute the Mean and Variance within Each Group:** For each group, compute the mean and variance of the activations within the group, across the spatial dimensions and batch.
13+
3. **Normalize the Inputs:** Normalize the activations of each group using the computed mean and variance.
14+
4. **Apply Scale and Shift:** After normalization, apply a learned scale (gamma) and shift (beta) to restore the model's ability to represent the data's original distribution.
15+
16+
### Structure of Group Normalization for BCHW Input
17+
18+
For an input tensor with the shape **BCHW** , where:
19+
- **B**: batch size,
20+
- **C**: number of channels,
21+
- **H**: height,
22+
- **W**: width,
23+
the Group Normalization process operates on specific dimensions based on the task's requirement.
24+
25+
#### 1. Group Division
26+
27+
- The input feature dimension **C** (channels) is divided into several groups. The number of groups is determined by the **n_groups** parameter, and the size of each group is calculated as:
28+
29+
$$
30+
\text{groupSize} = \frac{C}{n_{\text{groups}}}
31+
$$
32+
33+
Where:
34+
- **C** is the number of channels.
35+
- **n_groups** is the number of groups into which the channels are divided.
36+
- **groupSize** is the number of channels in each group.
37+
38+
The input tensor is then reshaped to group the channels into the specified groups.
39+
40+
#### 2. Mean and Variance Calculation within Groups
41+
42+
- For each group, the **mean** $\mu_g$ and **variance** $\sigma_g^2$ are computed over the spatial dimensions and across the batch. This normalization helps to stabilize the activations within each group.
43+
44+
$$
45+
\mu_g = \frac{1}{B \cdot H \cdot W \cdot \text{groupSize}} \sum_{i=1}^{B} \sum_{h=1}^{H} \sum_{w=1}^{W} \sum_{g=1}^{\text{groupSize}} x_{i,g,h,w}
46+
$$
47+
48+
$$
49+
\sigma_g^2 = \frac{1}{B \cdot H \cdot W \cdot \text{groupSize}} \sum_{i=1}^{B} \sum_{h=1}^{H} \sum_{w=1}^{W} \sum_{g=1}^{\text{groupSize}} (x_{i,g,h,w} - \mu_g)^2
50+
$$
51+
52+
Where:
53+
- $x_{i,g,h,w}$ is the activation at batch index $i$, group index $g$, height $h$, and width $w$.
54+
- $B$ is the batch size.
55+
- $H$ and $W$ are the spatial dimensions (height and width).
56+
- $\text{groupSize}$ is the number of channels in each group.
57+
58+
#### 3. Normalization
59+
60+
Once the mean $\mu_g$ and variance $\sigma_g^2$ have been computed for each group, the next step is to **normalize** the input. The normalization is done by subtracting the mean and dividing by the standard deviation (square root of the variance, plus a small constant $\epsilon$ for numerical stability):
61+
62+
$$
63+
\hat{x}_{i,g,h,w} = \frac{x_{i,g,h,w} - \mu_g}{\sqrt{\sigma_g^2 + \epsilon}}
64+
$$
65+
66+
Where:
67+
- $\hat{x}_{i,g,h,w}$ is the normalized activation for the input at batch index $i$, group index $g$, height $h$, and width $w$.
68+
- $\epsilon$ is a small constant to avoid division by zero.
69+
70+
#### 4. Scale and Shift
71+
72+
After normalization, the next step is to apply a **scale** ($\gamma_g$) and **shift** ($\beta_g$) to the normalized activations for each group. These learned parameters allow the model to adjust the output distribution of each group:
73+
74+
$$
75+
y_{i,g,h,w} = \gamma_g \hat{x}_{i,g,h,w} + \beta_g
76+
$$
77+
78+
Where:
79+
- $\gamma_g$ is the scaling factor for group $g$.
80+
- $\beta_g$ is the shifting factor for group $g$.
81+
82+
#### 5. Training and Inference
83+
84+
- **During Training**: The mean and variance are computed for each mini-batch and used for normalization within each group.
85+
- **During Inference**: The model uses running averages of the statistics (mean and variance) that were computed during training to ensure consistent behavior when deployed.
86+
87+
### Key Points
88+
89+
- **Group-wise Normalization**: Group Normalization normalizes within smaller groups of channels instead of normalizing over the entire batch and all channels. This allows for more stable training in cases with small batch sizes.
90+
91+
- **Number of Groups**: The number of groups is a hyperparameter (**n_groups**) that can significantly affect the model’s performance. It is typically set to divide the total number of channels into groups of equal size.
92+
93+
- **Smaller Batch Sizes**: Group Normalization is less dependent on the batch size, making it ideal for situations where batch sizes are small (e.g., segmentation tasks).
94+
95+
- **Numerical Stability**: As with other normalization techniques, a small constant $\epsilon$ is added to the variance to avoid numerical instability when dividing by the square root of variance.
96+
97+
- **Improved Convergence**: Group Normalization can help improve the gradient flow throughout the network, making it easier to train deep networks with small batch sizes. It also helps speed up convergence and stabilize training.
98+
99+
- **Regularization Effect**: Similar to Batch Normalization, Group Normalization introduces a form of regularization through the normalization process. It can reduce overfitting by acting as a noise source during training.
100+
101+
### Why Normalize Over Groups?
102+
103+
- **Group-wise Normalization**: By dividing the channels into smaller groups, Group Normalization ensures that each group has a stable distribution of activations, making it effective even when batch sizes are small.
104+
105+
- **Less Dependency on Batch Size**: Unlike Batch Normalization, Group Normalization does not require large batch sizes to compute accurate statistics. This makes it well-suited for tasks such as image segmentation, where large batch sizes may not be feasible.
106+
107+
- **Channel-wise Learning**: Group Normalization allows each group to learn independently, preserving flexibility while also controlling the complexity of normalization over channels.
108+
109+
By normalizing over smaller groups, Group Normalization can reduce internal covariate shift and allow for faster and more stable training, even in situations where Batch Normalization may be less effective due to small batch sizes.

0 commit comments

Comments
 (0)