Skip to content

Commit 74dc570

Browse files
authored
Merge pull request #500 from Open-Deep-ML/new-q-157
added new Q
2 parents 56fd697 + bcff9d5 commit 74dc570

File tree

8 files changed

+132
-123
lines changed

8 files changed

+132
-123
lines changed
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Write a function that performs one step of value iteration for a given Markov Decision Process (MDP) using the Bellman equation. The function should update the state-value function V(s) for each state based on possible actions, transition probabilities, rewards, and the discount factor gamma. Only use NumPy.
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{
2+
"input": "import numpy as np\ntransitions = [\n {0: [(1.0, 0, 0.0, False)], 1: [(1.0, 1, 1.0, False)]},\n {0: [(1.0, 0, 0.0, False)], 1: [(1.0, 1, 1.0, True)]}\n]\nV = np.array([0.0, 0.0])\ngamma = 0.9\nnew_V = bellman_update(V, transitions, gamma)\nprint(np.round(new_V, 2))",
3+
"output": "[1. 1.]",
4+
"reasoning": "For state 0, the best action is to go to state 1 and get a reward of 1. For state 1, taking action 1 gives a reward of 1 and ends the episode, so its value is 1."
5+
}
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# **The Bellman Equation**
2+
3+
The **Bellman equation** is a fundamental recursive equation in reinforcement learning that relates the value of a state to the values of possible next states. It provides the mathematical foundation for key RL algorithms such as value iteration and Q-learning.
4+
5+
---
6+
7+
## **Key Idea**
8+
For each state $s$, the value $V(s)$ is the maximum expected return obtainable by choosing the best action $a$ and then following the optimal policy:
9+
10+
$$
11+
V(s) = \max_{a} \sum_{s'} P(s'|s, a) \left[ R(s, a, s') + \gamma V(s') \right]
12+
$$
13+
14+
Where:
15+
- $V(s)$: value of state $s$
16+
- $a$: possible actions
17+
- $P(s'|s, a)$: probability of moving to state $s'$ from $s$ via $a$
18+
- $R(s, a, s')$: reward for this transition
19+
- $\gamma$: discount factor ($0 \leq \gamma \leq 1$)
20+
- $V(s')$: value of next state
21+
22+
---
23+
24+
## **How to Use**
25+
1. **For each state:**
26+
- For each possible action, sum over possible next states, weighting by transition probability.
27+
- Add the immediate reward and the discounted value of the next state.
28+
- Choose the action with the highest expected value (for control).
29+
2. **Repeat until values converge** (value iteration) or as part of other RL updates.
30+
31+
---
32+
33+
## **Applications**
34+
- **Value Iteration** and **Policy Iteration** in Markov Decision Processes (MDP)
35+
- **Q-learning** and other RL algorithms
36+
- Calculating the optimal value function and policy in gridworlds, games, and general MDPs
37+
38+
---
39+
40+
## **Why It Matters**
41+
- The Bellman equation formalizes the notion of **optimality** in sequential decision-making.
42+
- It is a backbone for teaching agents to solve environments with rewards, uncertainty, and long-term planning.
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
{
2+
"id": "157",
3+
"title": "Implement the Bellman Equation for Value Iteration",
4+
"difficulty": "medium",
5+
"category": "Reinforcement Learning",
6+
"video": "",
7+
"likes": "0",
8+
"dislikes": "0",
9+
"contributor": [
10+
{
11+
"profile_link": "https://github.com/moe18",
12+
"name": "Moe Chabot"
13+
}
14+
]
15+
}
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
import numpy as np
2+
3+
def bellman_update(V, transitions, gamma):
4+
n_states = len(V)
5+
new_V = np.zeros_like(V)
6+
for s in range(n_states):
7+
action_values = []
8+
for a in transitions[s]:
9+
total = 0
10+
for prob, next_s, reward, done in transitions[s][a]:
11+
total += prob * (reward + gamma * (0 if done else V[next_s]))
12+
action_values.append(total)
13+
new_V[s] = max(action_values)
14+
return new_V
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
import numpy as np
2+
3+
def bellman_update(V, transitions, gamma):
4+
"""
5+
Perform one step of value iteration using the Bellman equation.
6+
Args:
7+
V: np.ndarray, state values, shape (n_states,)
8+
transitions: list of dicts. transitions[s][a] is a list of (prob, next_state, reward, done)
9+
gamma: float, discount factor
10+
Returns:
11+
np.ndarray, updated state values
12+
"""
13+
# TODO: Implement Bellman update
14+
pass
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
[
2+
{
3+
"test": "import numpy as np\ntransitions = [\n # For state 0\n {0: [(1.0, 0, 0.0, False)], 1: [(1.0, 1, 1.0, False)]},\n # For state 1\n {0: [(1.0, 0, 0.0, False)], 1: [(1.0, 1, 1.0, True)]}\n]\nV = np.array([0.0, 0.0])\ngamma = 0.9\nnew_V = bellman_update(V, transitions, gamma)\nprint(np.round(new_V, 2))",
4+
"expected_output": "[1., 1.]"
5+
},
6+
{
7+
"test": "import numpy as np\ntransitions = [\n {0: [(0.8, 0, 5, False), (0.2, 1, 10, False)], 1: [(1.0, 1, 2, False)]},\n {0: [(1.0, 0, 0, False)], 1: [(1.0, 1, 0, True)]}\n]\nV = np.array([0.0, 0.0])\ngamma = 0.5\nnew_V = bellman_update(V, transitions, gamma)\nprint(np.round(new_V, 2))",
8+
"expected_output": "[6., 0.]"
9+
}
10+
]

utils/convert_single_question.py

Lines changed: 31 additions & 123 deletions
Original file line numberDiff line numberDiff line change
@@ -28,130 +28,38 @@
2828

2929
# ── 1️⃣ EDIT YOUR QUESTION HERE ────────────────────────────────────────────
3030
QUESTION_DICT: Dict[str, Any] = {
31-
"id": "143",
32-
"title": "Instance Normalization (IN) Implementation",
33-
"description": "Implement the Instance Normalization operation for 4D tensors (B, C, H, W) using NumPy. For each instance in the batch and each channel, normalize the spatial dimensions (height and width) by subtracting the mean and dividing by the standard deviation, then apply a learned scale (gamma) and shift (beta).",
34-
"test_cases": [
35-
{
36-
"test": "import numpy as np\nB, C, H, W = 2, 2, 2, 2\nnp.random.seed(42)\nX = np.random.randn(B, C, H, W)\ngamma = np.ones(C)\nbeta = np.zeros(C)\nout = instance_normalization(X, gamma, beta)\nprint(np.round(out[1][1], 4))",
37-
"expected_output": "[[ 1.4005, -1.0503] [-0.8361, 0.486 ]]"
38-
},
39-
{
40-
"test": "import numpy as np\nB, C, H, W = 2, 2, 2, 2\nnp.random.seed(101)\nX = np.random.randn(B, C, H, W)\ngamma = np.ones(C)\nbeta = np.zeros(C)\nout = instance_normalization(X, gamma, beta)\nprint(np.round(out[1][0], 4))",
41-
"expected_output": "[[-1.537, 0.9811], [ 0.7882, -0.2323]]"
42-
},
43-
{
44-
"test": "import numpy as np\nB, C, H, W = 2, 2, 2, 2\nnp.random.seed(101)\nX = np.random.randn(B, C, H, W)\ngamma = np.ones(C) * 0.5\nbeta = np.ones(C)\nout = instance_normalization(X, gamma, beta)\nprint(np.round(out[0][0], 4))",
45-
"expected_output": "[[1.8542, 0.6861], [0.8434, 0.6163]]"
46-
}
47-
],
48-
"solution": "import numpy as np\n\ndef instance_normalization(X: np.ndarray, gamma: np.ndarray, beta: np.ndarray, epsilon: float = 1e-5) -> np.ndarray:\n # Reshape gamma, beta for broadcasting: (1, C, 1, 1)\n gamma = gamma.reshape(1, -1, 1, 1)\n beta = beta.reshape(1, -1, 1, 1)\n mean = np.mean(X, axis=(2, 3), keepdims=True)\n var = np.var(X, axis=(2, 3), keepdims=True)\n X_norm = (X - mean) / np.sqrt(var + epsilon)\n return gamma * X_norm + beta",
49-
"example": {
50-
"input": "import numpy as np\nB, C, H, W = 2, 2, 2, 2\nnp.random.seed(42)\nX = np.random.randn(B, C, H, W)\ngamma = np.ones(C)\nbeta = np.zeros(C)\nout = instance_normalization(X, gamma, beta)\nprint(np.round(out, 8))",
51-
"output": "[[[[-0.08841405 -0.50250083]\n [ 0.01004046 0.58087442]]\n\n [[-0.43833369 -0.43832346]\n [ 0.69114093 0.18551622]]]\n\n [[[-0.17259136 0.51115219]\n [-0.16849938 -0.17006144]]\n\n [[ 0.73955155 -0.55463639]\n [-0.44152783 0.25661268]]]]",
52-
"reasoning": "The function normalizes each instance and channel across (H, W), then applies the gamma and beta scaling/shifting parameters. This matches standard InstanceNorm behavior."
31+
"id": "157",
32+
"title": "Implement the Bellman Equation for Value Iteration",
33+
"description": "Write a function that performs one step of value iteration for a given Markov Decision Process (MDP) using the Bellman equation. The function should update the state-value function V(s) for each state based on possible actions, transition probabilities, rewards, and the discount factor gamma. Only use NumPy.",
34+
"test_cases": [
35+
{
36+
"test": "import numpy as np\ntransitions = [\n # For state 0\n {0: [(1.0, 0, 0.0, False)], 1: [(1.0, 1, 1.0, False)]},\n # For state 1\n {0: [(1.0, 0, 0.0, False)], 1: [(1.0, 1, 1.0, True)]}\n]\nV = np.array([0.0, 0.0])\ngamma = 0.9\nnew_V = bellman_update(V, transitions, gamma)\nprint(np.round(new_V, 2))",
37+
"expected_output": "[1., 1.]"
5338
},
54-
"category": "Deep Learning",
55-
"starter_code": "import numpy as np\n\ndef instance_normalization(X: np.ndarray, gamma: np.ndarray, beta: np.ndarray, epsilon: float = 1e-5) -> np.ndarray:\n \"\"\"\n Perform Instance Normalization over a 4D tensor X of shape (B, C, H, W).\n gamma: scale parameter of shape (C,)\n beta: shift parameter of shape (C,)\n epsilon: small value for numerical stability\n Returns: normalized array of same shape as X\n \"\"\"\n # TODO: Implement Instance Normalization\n pass",
56-
"learn_section": r"""## Understanding Instance Normalization
57-
58-
Instance Normalization (IN) is a normalization technique primarily used in image generation and style transfer tasks. Unlike Batch Normalization or Group Normalization, Instance Normalization normalizes each individual sample (or instance) separately, across its spatial dimensions. This is particularly effective in applications like style transfer, where normalization is needed per image to preserve the content while allowing different styles to be applied.
59-
60-
### Concepts
61-
62-
Instance Normalization operates on the principle of normalizing each individual sample independently. This helps to remove the style information from the images, leaving only the content. By normalizing each instance, the method allows the model to focus on the content of the image rather than the variations between images in a batch.
63-
64-
The process of Instance Normalization consists of the following steps:
65-
66-
1. **Compute the Mean and Variance for Each Instance:** For each instance (image), compute the mean and variance across its spatial dimensions.
67-
2. **Normalize the Inputs:** Normalize each instance using the computed mean and variance.
68-
3. **Apply Scale and Shift:** After normalization, apply a learned scale (gamma) and shift (beta) to restore the model's ability to represent the data's original distribution.
69-
70-
### Structure of Instance Normalization for BCHW Input
71-
72-
For an input tensor with the shape **BCHW** , where:
73-
- **B**: batch size,
74-
- **C**: number of channels,
75-
- **H**: height,
76-
- **W**: width,
77-
Instance Normalization operates on the spatial dimensions (height and width) of each instance (image) separately.
78-
79-
#### 1. Mean and Variance Calculation for Each Instance
80-
81-
- For each individual instance in the batch (for each **b** in **B**), the **mean** $\mu_b$ and **variance** $\sigma_b^2$ are computed across the spatial dimensions (height and width), but **independently for each channel**.
82-
83-
$$
84-
\mu_b = \frac{1}{H \cdot W} \sum_{h=1}^{H} \sum_{w=1}^{W} x_{b,c,h,w}
85-
$$
86-
87-
$$
88-
\sigma_b^2 = \frac{1}{H \cdot W} \sum_{h=1}^{H} \sum_{w=1}^{W} (x_{b,c,h,w} - \mu_b)^2
89-
$$
90-
91-
Where:
92-
- $x_{b,c,h,w}$ is the activation at batch index $b$, channel $c$, height $h$, and width $w$.
93-
- $H$ and $W$ are the spatial dimensions (height and width).
94-
95-
#### 2. Normalization
96-
97-
Once the mean $\mu_b$ and variance $\sigma_b^2$ have been computed for each instance, the next step is to **normalize** the input for each instance across the spatial dimensions (height and width), for each channel:
98-
99-
$$
100-
\hat{x}_{b,c,h,w} = \frac{x_{b,c,h,w} - \mu_b}{\sqrt{\sigma_b^2 + \epsilon}}
101-
$$
102-
103-
Where:
104-
- $\hat{x}_{b,c,h,w}$ is the normalized activation for the input at batch index $b$, channel index $c$, height $h$, and width $w$.
105-
- $\epsilon$ is a small constant added to the variance for numerical stability.
106-
107-
#### 3. Scale and Shift
108-
109-
After normalization, the next step is to apply a **scale** ($\gamma_c$) and **shift** ($\beta_c$) to the normalized activations for each channel. These learned parameters allow the model to adjust the output distribution for each channel:
110-
111-
$$
112-
y_{b,c,h,w} = \gamma_c \hat{x}_{b,c,h,w} + \beta_c
113-
$$
114-
115-
Where:
116-
- $\gamma_c$ is the scaling factor for channel $c$.
117-
- $\beta_c$ is the shifting factor for channel $c$.
118-
119-
#### 4. Training and Inference
120-
121-
- **During Training**: The mean and variance are computed for each instance in the mini-batch and used for normalization.
122-
- **During Inference**: The model uses the running averages of the statistics (mean and variance) computed during training to ensure consistent behavior in production.
123-
124-
### Key Points
125-
126-
- **Instance-wise Normalization**: Instance Normalization normalizes each image independently, across its spatial dimensions (height and width) and across the channels.
127-
128-
- **Style Transfer**: This normalization technique is widely used in **style transfer** tasks, where each image must be normalized independently to allow for style information to be adjusted without affecting the content.
129-
130-
- **Batch Independence**: Instance Normalization does not depend on the batch size, as normalization is applied per instance, making it suitable for tasks where per-image normalization is critical.
131-
132-
- **Numerical Stability**: A small constant $\epsilon$ is added to the variance to avoid numerical instability when dividing by the square root of the variance.
133-
134-
- **Improved Training in Style-Related Tasks**: Instance Normalization helps to remove unwanted style-related variability across different images, allowing for better performance in tasks like style transfer, where the goal is to separate content and style information.
135-
136-
### Why Normalize Over Instances?
137-
138-
- **Content Preservation**: By normalizing each image individually, Instance Normalization allows the model to preserve the content of the images while adjusting the style. This makes it ideal for style transfer and other image manipulation tasks.
139-
140-
- **Batch Independence**: Unlike Batch Normalization, which requires large batch sizes to compute statistics, Instance Normalization normalizes each image independently, making it suitable for tasks where the batch size is small or varies.
141-
142-
- **Reducing Style Variability**: Instance Normalization removes the variability in style information across a batch, allowing for a consistent representation of content across different images.
143-
144-
In summary, Instance Normalization is effective for image-based tasks like style transfer, where the goal is to normalize each image independently to preserve its content while allowing style modifications.""",
145-
"contributor": [
146-
{
147-
"profile_link": "https://github.com/nzomi",
148-
"name": "nzomi"
149-
}
150-
],
151-
"likes": "0",
152-
"dislikes": "0",
153-
"difficulty": "medium",
154-
"video": ""
39+
{
40+
"test": "import numpy as np\ntransitions = [\n {0: [(0.8, 0, 5, False), (0.2, 1, 10, False)], 1: [(1.0, 1, 2, False)]},\n {0: [(1.0, 0, 0, False)], 1: [(1.0, 1, 0, True)]}\n]\nV = np.array([0.0, 0.0])\ngamma = 0.5\nnew_V = bellman_update(V, transitions, gamma)\nprint(np.round(new_V, 2))",
41+
"expected_output": "[6., 0.]"
42+
}
43+
],
44+
"solution": "import numpy as np\n\ndef bellman_update(V, transitions, gamma):\n n_states = len(V)\n new_V = np.zeros_like(V)\n for s in range(n_states):\n action_values = []\n for a in transitions[s]:\n total = 0\n for prob, next_s, reward, done in transitions[s][a]:\n total += prob * (reward + gamma * (0 if done else V[next_s]))\n action_values.append(total)\n new_V[s] = max(action_values)\n return new_V",
45+
"example": {
46+
"input": "import numpy as np\ntransitions = [\n {0: [(1.0, 0, 0.0, False)], 1: [(1.0, 1, 1.0, False)]},\n {0: [(1.0, 0, 0.0, False)], 1: [(1.0, 1, 1.0, True)]}\n]\nV = np.array([0.0, 0.0])\ngamma = 0.9\nnew_V = bellman_update(V, transitions, gamma)\nprint(np.round(new_V, 2))",
47+
"output": "[1. 1.]",
48+
"reasoning": "For state 0, the best action is to go to state 1 and get a reward of 1. For state 1, taking action 1 gives a reward of 1 and ends the episode, so its value is 1."
49+
},
50+
"category": "Reinforcement Learning",
51+
"starter_code": "import numpy as np\n\ndef bellman_update(V, transitions, gamma):\n \"\"\"\n Perform one step of value iteration using the Bellman equation.\n Args:\n V: np.ndarray, state values, shape (n_states,)\n transitions: list of dicts. transitions[s][a] is a list of (prob, next_state, reward, done)\n gamma: float, discount factor\n Returns:\n np.ndarray, updated state values\n \"\"\"\n # TODO: Implement Bellman update\n pass",
52+
"learn_section": "# **The Bellman Equation**\n\nThe **Bellman equation** is a fundamental recursive equation in reinforcement learning that relates the value of a state to the values of possible next states. It provides the mathematical foundation for key RL algorithms such as value iteration and Q-learning.\n\n---\n\n## **Key Idea**\nFor each state $s$, the value $V(s)$ is the maximum expected return obtainable by choosing the best action $a$ and then following the optimal policy:\n\n$$\nV(s) = \\max_{a} \\sum_{s'} P(s'|s, a) \\left[ R(s, a, s') + \\gamma V(s') \\right]\n$$\n\nWhere:\n- $V(s)$: value of state $s$\n- $a$: possible actions\n- $P(s'|s, a)$: probability of moving to state $s'$ from $s$ via $a$\n- $R(s, a, s')$: reward for this transition\n- $\\gamma$: discount factor ($0 \\leq \\gamma \\leq 1$)\n- $V(s')$: value of next state\n\n---\n\n## **How to Use**\n1. **For each state:**\n - For each possible action, sum over possible next states, weighting by transition probability.\n - Add the immediate reward and the discounted value of the next state.\n - Choose the action with the highest expected value (for control).\n2. **Repeat until values converge** (value iteration) or as part of other RL updates.\n\n---\n\n## **Applications**\n- **Value Iteration** and **Policy Iteration** in Markov Decision Processes (MDP)\n- **Q-learning** and other RL algorithms\n- Calculating the optimal value function and policy in gridworlds, games, and general MDPs\n\n---\n\n## **Why It Matters**\n- The Bellman equation formalizes the notion of **optimality** in sequential decision-making.\n- It is a backbone for teaching agents to solve environments with rewards, uncertainty, and long-term planning.",
53+
"contributor": [
54+
{
55+
"profile_link": "https://github.com/moe18",
56+
"name": "Moe Chabot"
57+
}
58+
],
59+
"likes": "0",
60+
"dislikes": "0",
61+
"difficulty": "medium",
62+
"video": ""
15563
}
15664

15765

0 commit comments

Comments
 (0)