Skip to content

Commit 08f842f

Browse files
committed
Include question on SwiGLU Activation Function
1 parent ca5ad32 commit 08f842f

File tree

10 files changed

+215
-0
lines changed

10 files changed

+215
-0
lines changed
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
## Problem
2+
3+
Implement a Python function that applies the **SwiGLU activation function** to a NumPy array.
4+
5+
Assume the input array has already been passed through a linear projection and has shape `(batch_size, 2d)`. Round each output to four decimal places and return the result as a NumPy array of the shape `(batch_size, d)`.
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{
2+
"input": "np.array([[1, -1, 1000, -1000]])",
3+
"output": "[[1000., 0.]]",
4+
"reasoning": "The input is of shape (1, 4), so it is split into x1 = [1, -1] and x2 = [1000, -1000]. The sigmoid of 1000 is approximately 1, and the sigmoid of -1000 is approximately 0 due to saturation. Thus, Swish(1000) ≈ 1000 x 1 = 1000 and Swish(-1000) ≈ -1000 x 0 = 0. Then, SwiGLU = x1 * Swish(x2) = [1 x 1000, -1 x 0] = [1000, 0]."
5+
}
Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
## Understanding the SwiGLU Activation Function
2+
3+
As the name suggests the SwiGLU activation function is a combination of two activations - Swish (implemented as SiLU in PyTorch) and GLU (Gated Linear Unit). It is important that we understand Swish and GLU because SwiGLU inherits properties from both — the smooth self-gating behavior of Swish, the decoupled gating structure of GLU.
4+
5+
### Swish Activation (Self-Gating)
6+
7+
**Swish**, introduced by Google Brain, is a smooth, self-gated activation function defined as:
8+
9+
$$
10+
\text{Swish}(x) = x \cdot \sigma(x)
11+
$$
12+
13+
where the sigmoid function is:
14+
15+
$$
16+
\sigma(x) = \frac{1}{1 + e^{-x}}
17+
$$
18+
19+
In Swish, the same input $x$ is used to:
20+
- **Compute the gate**: $\sigma(x)$
21+
- **Modulate itself**: $x \cdot \sigma(x)$
22+
23+
This is called **self-gating** — the input both **creates** and **passes through** the gate. \
24+
**Note:** When written in a PyTorch forward loop, it looks something like -
25+
```bash
26+
import torch.nn.functional as F
27+
28+
def forward(self, x):
29+
x1 = self.fc1(x) # x1 = Wx + b where W, b are learnable params
30+
output = F.silu(x) # output = x1 * sigmoid(x1)
31+
return output # output = (Wx + b) * sigmoid(Wx + b)
32+
```
33+
This essentially means that the gate is learnable, and the model learns the best shape of the activation function.
34+
35+
### Gated Linear Unit (GLU)
36+
37+
**GLU**, introduced in *Language Modeling with Gated Convolutional Networks* (Dauphin et al., 2017), is a gated activation mechanism defined as:
38+
39+
$$
40+
\text{GLU}(x_1, x_2) = x_1 \cdot \sigma(x_2)
41+
$$
42+
43+
Here:
44+
- $x_1$ is the **input signal**.
45+
- $x_2$ is used to **compute the gate** via the sigmoid function.
46+
47+
In practice, both $x_1$ and $x_2$ are obtained by **splitting the output of a single linear layer**:
48+
49+
```bash
50+
import torch.nn.functional as F
51+
52+
def forward(self, x):
53+
x_proj = self.fc1(x)
54+
x1, x2 = x_proj.chunk(2, dim=-1) # x1 = Wx + b, x2 = Vx + c
55+
output = x1 * torch.sigmoid(x2) # GLU = x1 · σ(x2)
56+
return output
57+
```
58+
So GLU can be rewritten as:
59+
60+
$$
61+
\text{GLU}(x) = x_1 \cdot \sigma(x_2)
62+
$$
63+
where:
64+
$$x_1 = W x + b$$
65+
$$x_2 = V x + c$$
66+
67+
68+
This is a learned, cross-gating mechanism — the model learns different parameters for the signal and the gate.
69+
70+
71+
## SwiGLU
72+
73+
With Swish and GLU out of the way, it becomes very easy to understand **SwiGLU**. It is defined as:
74+
75+
$$
76+
\text{SwiGLU}(x) = x_1 \cdot \text{Swish}(x_2)
77+
$$
78+
79+
Where:
80+
- $x_1, x_2$ are typically obtained by splitting a linear projection of the input (inspired by GLU).
81+
82+
- $\text{Swish}(x_2) = x_2 \cdot \sigma(x_2)$ is the self-gated activation.
83+
84+
So putting it together:
85+
86+
$$
87+
\text{SwiGLU}(x) = x_1 \cdot (x_2 \cdot \sigma(x_2))
88+
$$
89+
90+
This combines the **signal-gate decoupling** of GLU with the **smooth self-gating** of Swish, and is used in the feed-forward blocks of large-scale models like Google's PaLM, Meta's LLaMA.
91+
92+
93+
### Why Does It Work?
94+
> Noam Shazeer, the author in his paper writes: "We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence."
95+
96+
The improvement in performance have only been proven *emprically* by observing faster convergence during training
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
{
2+
"id": "151",
3+
"title": "Implement SwiGLU activation function",
4+
"difficulty": "easy",
5+
"category": "Deep Learning",
6+
"video": "",
7+
"likes": "0",
8+
"dislikes": "0",
9+
"contributor": [
10+
{
11+
"profile_link": "https://github.com/PT-10",
12+
"name": "PT-10"
13+
}
14+
],
15+
"pytorch_difficulty": "easy"
16+
}
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
import torch
2+
3+
def SwiGLU(x: torch.Tensor) -> torch.Tensor:
4+
"""
5+
Apply the SwiGLU activation function.
6+
Assumes:
7+
- Input x is a torch tensor of shape (batch_size, 2d)
8+
- x has already been passed through a linear projection layer
9+
10+
Returns:
11+
- Tensor of shape (batch_size, d) after applying SwiGLU:
12+
x1 * SiLU(x2), where [x1, x2] = split(x)
13+
"""
14+
x1, x2 = x.chunk(2, dim=-1)
15+
return x1 * torch.nn.functional.silu(x2)
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
import torch
2+
3+
def swiglu(x: torch.Tensor) -> torch.Tensor:
4+
"""
5+
Apply the SwiGLU activation function.
6+
Assumes:
7+
- Input x is a torch tensor of shape (batch_size, 2d)
8+
- x has already been passed through a linear projection layer
9+
10+
Returns:
11+
- Tensor of shape (batch_size, d) after applying SwiGLU:
12+
"""
13+
return scores
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
[
2+
{
3+
"test": "print(torch.round(SwiGLU(torch.tensor([[0., 0., 0., 0.]])), decimals=4))",
4+
"expected_output": "tensor([[0., 0.]])"
5+
},
6+
{
7+
"test": "print(torch.round(SwiGLU(torch.tensor([[1.0, -1.0, 2.0, -2.0]])), decimals=4))",
8+
"expected_output": "tensor([[1.7616, 0.2384]])"
9+
},
10+
{
11+
"test": "print(torch.round(SwiGLU(torch.tensor([[1, -1, 1000, -1000]])), decimals=4))",
12+
"expected_output": "tensor([[1000., 0.]])"
13+
},
14+
{
15+
"test": "print(torch.round(SwiGLU(torch.tensor([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])), decimals=4))",
16+
"expected_output": "tensor([[2.8577, 7.8561], [34.9681, 47.9839], [98.9984, 119.9992]])"
17+
}
18+
]
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
import numpy as np
2+
3+
def SwiGLU(x: np.ndarray) -> np.ndarray:
4+
"""
5+
Args:
6+
x: np.ndarray of shape (batch_size, 2d)
7+
8+
Returns:
9+
np.ndarray of shape (batch_size, d)
10+
"""
11+
def sigmoid(x):
12+
return 1 / (1 + np.exp(-x))
13+
14+
d = x.shape[1] // 2
15+
x1 = x[:, :d]
16+
x2 = x[:, d:]
17+
return x1 * (x2 * sigmoid(x2))
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
import numpy as np
2+
3+
def SwiGLU(x: np.ndarray) -> np.ndarray:
4+
"""
5+
Args:
6+
x: np.ndarray of shape (batch_size, 2d)
7+
8+
Returns:
9+
np.ndarray of shape (batch_size, d)
10+
"""
11+
# Your code here
12+
return scores
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
[
2+
{
3+
"test": "print(np.round(SwiGLU(np.zeros((1, 4))), 4))",
4+
"expected_output": "[[0., 0.]]"
5+
},
6+
{
7+
"test": "print(np.round(SwiGLU(np.array([[1.0, -1.0, 2.0, -2.0]])), 4))",
8+
"expected_output": "[[1.7616, 0.2384]]"
9+
},
10+
{
11+
"test": "print(np.round(SwiGLU(np.array([[1, -1, 1000, -1000]])), 4))",
12+
"expected_output": "[[1000., 0.]]"
13+
},
14+
{
15+
"test": "print(np.round(SwiGLU(np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])), 4))",
16+
"expected_output": "[[2.8577, 7.8561], [34.9681, 47.9839], [98.9983, 119.9993]]"
17+
}
18+
]

0 commit comments

Comments
 (0)