Skip to content

Commit 52e6105

Browse files
authored
Merge pull request #477 from Open-Deep-ML/add-q-142
added new question
2 parents 2b8a688 + a070a0b commit 52e6105

File tree

8 files changed

+155
-17
lines changed

8 files changed

+155
-17
lines changed
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Implement policy evaluation for a 5x5 gridworld. Given a policy (mapping each state to action probabilities), compute the state-value function $V(s)$ for each cell using the Bellman expectation equation. The agent can move up, down, left, or right, receiving a constant reward of -1 for each move. Terminal states (the four corners) are fixed at 0. Iterate until the largest change in $V$ is less than a given threshold. Only use Python built-ins and no external RL libraries.
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{
2+
"input": "policy = {(i, j): {'up': 0.25, 'down': 0.25, 'left': 0.25, 'right': 0.25} for i in range(5) for j in range(5)}\ngamma = 0.9\nthreshold = 0.001\nV = gridworld_policy_evaluation(policy, gamma, threshold)\nprint(round(V[2][2], 4))",
3+
"output": "-7.0902",
4+
"reasoning": "The policy is uniform (equal chance of each move). The agent receives -1 per step. After iterative updates, the center state value converges to about -7.09, and corners remain at 0."
5+
}
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# Gridworld Policy Evaluation
2+
3+
In reinforcement learning, **policy evaluation** is the process of computing the state-value function for a given policy. For a gridworld environment, this involves iteratively updating the value of each state based on the expected return following the policy.
4+
5+
## Key Concepts
6+
7+
- **State-Value Function (V):**
8+
The expected return when starting from a state and following a given policy.
9+
10+
- **Policy:**
11+
A mapping from states to probabilities of selecting each available action.
12+
13+
- **Bellman Expectation Equation:**
14+
For each state $s$:
15+
$$
16+
V(s) = \sum_{a} \pi(a|s) \sum_{s'} P(s'|s,a) [R(s,a,s') + \gamma V(s')]
17+
$$
18+
where:
19+
- $ \pi(a|s) $ is the probability of taking action $ a $ in state $ s $,
20+
- $ P(s'|s,a) $ is the probability of transitioning to state $ s' $,
21+
- $ R(s,a,s') $ is the reward for that transition,
22+
- $ \gamma $ is the discount factor.
23+
24+
## Algorithm Overview
25+
26+
1. **Initialization:**
27+
Start with an initial guess (commonly zeros) for the state-value function $ V(s) $.
28+
29+
2. **Iterative Update:**
30+
For each non-terminal state, update the state value using the Bellman expectation equation. Continue updating until the maximum change in value (delta) is less than a given threshold.
31+
32+
3. **Terminal States:**
33+
For this example, the four corners of the grid are considered terminal, so their values remain unchanged.
34+
35+
This evaluation method is essential for understanding how "good" each state is under a specific policy, and it forms the basis for more advanced reinforcement learning algorithms.
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
{
2+
"id": "142",
3+
"title": "Gridworld Policy Evaluation",
4+
"difficulty": "medium",
5+
"category": "Reinforcement Learning",
6+
"video": "",
7+
"likes": "0",
8+
"dislikes": "0",
9+
"contributor": [
10+
{
11+
"profile_link": "https://github.com/arpitsinghgautam",
12+
"name": "Arpit Singh Gautam"
13+
}
14+
]
15+
}
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
def gridworld_policy_evaluation(policy: dict, gamma: float, threshold: float) -> list[list[float]]:
2+
grid_size = 5
3+
V = [[0.0 for _ in range(grid_size)] for _ in range(grid_size)]
4+
actions = {'up': (-1, 0), 'down': (1, 0), 'left': (0, -1), 'right': (0, 1)}
5+
reward = -1
6+
while True:
7+
delta = 0.0
8+
new_V = [row[:] for row in V]
9+
for i in range(grid_size):
10+
for j in range(grid_size):
11+
if (i, j) in [(0, 0), (0, grid_size-1), (grid_size-1, 0), (grid_size-1, grid_size-1)]:
12+
continue
13+
v = 0.0
14+
for action, prob in policy[(i, j)].items():
15+
di, dj = actions[action]
16+
ni = i + di if 0 <= i + di < grid_size else i
17+
nj = j + dj if 0 <= j + dj < grid_size else j
18+
v += prob * (reward + gamma * V[ni][nj])
19+
new_V[i][j] = v
20+
delta = max(delta, abs(V[i][j] - new_V[i][j]))
21+
V = new_V
22+
if delta < threshold:
23+
break
24+
return V
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
def gridworld_policy_evaluation(policy: dict, gamma: float, threshold: float) -> list[list[float]]:
2+
"""
3+
Evaluate state-value function for a policy on a 5x5 gridworld.
4+
5+
Args:
6+
policy: dict mapping (row, col) to action probability dicts
7+
gamma: discount factor
8+
threshold: convergence threshold
9+
Returns:
10+
5x5 list of floats
11+
"""
12+
# Your code here
13+
pass
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
[
2+
{
3+
"test": "grid_size = 5\ngamma = 0.9\nthreshold = 0.001\npolicy = {(i, j): {'up': 0.25, 'down': 0.25, 'left': 0.25, 'right': 0.25} for i in range(grid_size) for j in range(grid_size)}\nV = gridworld_policy_evaluation(policy, gamma, threshold)\nprint([round(V[2][2], 4), V[0][0], V[0][4], V[4][0], V[4][4]])",
4+
"expected_output": "[-7.0902, 0.0, 0.0, 0.0, 0.0]"
5+
},
6+
{
7+
"test": "grid_size = 5\ngamma = 0.9\nthreshold = 0.001\npolicy = {(i, j): {'up': 0.1, 'down': 0.4, 'left': 0.1, 'right': 0.4} for i in range(grid_size) for j in range(grid_size)}\nV = gridworld_policy_evaluation(policy, gamma, threshold)\nprint(round(V[1][3], 4) < 0)",
8+
"expected_output": "True"
9+
}
10+
]

utils/convert_single_question.py

Lines changed: 52 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -28,41 +28,76 @@
2828

2929
# ── 1️⃣ EDIT YOUR QUESTION HERE ────────────────────────────────────────────
3030
QUESTION_DICT: Dict[str, Any] = {
31-
"id": "141",
32-
"description": "Write a Python function `convert_range` that shifts and scales the values of a NumPy array from their original range $[a, b]$ (where $a=\\min(x)$ and $b=\\max(x)$) to a new target range $[c, d]$. Your function should work for both 1D and 2D arrays, returning an array of the same shape, and only use NumPy. Return floating-point results, and ensure you use the correct formula to map the input interval to the output interval.",
31+
"id": "142",
32+
"title": "Gridworld Policy Evaluation",
33+
"description": "Implement policy evaluation for a 5x5 gridworld. Given a policy (mapping each state to action probabilities), compute the state-value function $V(s)$ for each cell using the Bellman expectation equation. The agent can move up, down, left, or right, receiving a constant reward of -1 for each move. Terminal states (the four corners) are fixed at 0. Iterate until the largest change in $V$ is less than a given threshold. Only use Python built-ins and no external RL libraries.",
3334
"test_cases": [
3435
{
35-
"test": "import numpy as np\nseq = np.array([388, 242, 124, 384, 313, 277, 339, 302, 268, 392])\nc, d = 0, 1\nout = convert_range(seq, c, d)\nprint(np.round(out, 6))",
36-
"expected_output": "[0.985075, 0.440299, 0., 0.970149, 0.705224, 0.570896, 0.802239, 0.664179, 0.537313, 1. ]"
36+
"test": "grid_size = 5\ngamma = 0.9\nthreshold = 0.001\npolicy = {(i, j): {'up': 0.25, 'down': 0.25, 'left': 0.25, 'right': 0.25} for i in range(grid_size) for j in range(grid_size)}\nV = gridworld_policy_evaluation(policy, gamma, threshold)\nprint([round(V[2][2], 4), V[0][0], V[0][4], V[4][0], V[4][4]])",
37+
"expected_output": "[-7.0902, 0.0, 0.0, 0.0, 0.0]"
3738
},
3839
{
39-
"test": "import numpy as np\nseq = np.array([[2028, 4522], [1412, 2502], [3414, 3694], [1747, 1233], [1862, 4868]])\nc, d = 4, 8\nout = convert_range(seq, c, d)\nprint(np.round(out, 6))",
40-
"expected_output": "[[4.874828 7.619257]\n [4.196974 5.396424]\n [6.4 6.708116]\n [4.565612 4. ]\n [4.69216 8. ]]"
40+
"test": "grid_size = 5\ngamma = 0.9\nthreshold = 0.001\npolicy = {(i, j): {'up': 0.1, 'down': 0.4, 'left': 0.1, 'right': 0.4} for i in range(grid_size) for j in range(grid_size)}\nV = gridworld_policy_evaluation(policy, gamma, threshold)\nprint(round(V[1][3], 4) < 0)",
41+
"expected_output": "True"
4142
}
4243
],
43-
"solution": "import numpy as np\n\ndef convert_range(values: np.ndarray, c: float, d: float) -> np.ndarray:\n \"\"\"\n Shift and scale values from their original range [min, max] to a target [c, d] range.\n\n Parameters\n ----------\n values : np.ndarray\n Input array (1D or 2D) to be rescaled.\n c : float\n New range lower bound.\n d : float\n New range upper bound.\n\n Returns\n -------\n np.ndarray\n Scaled array with the same shape as the input.\n \"\"\"\n a, b = values.min(), values.max()\n return c + (d - c) / (b - a) * (values - a)",
44+
"solution": "def gridworld_policy_evaluation(policy: dict, gamma: float, threshold: float) -> list[list[float]]:\n grid_size = 5\n V = [[0.0 for _ in range(grid_size)] for _ in range(grid_size)]\n actions = {'up': (-1, 0), 'down': (1, 0), 'left': (0, -1), 'right': (0, 1)}\n reward = -1\n while True:\n delta = 0.0\n new_V = [row[:] for row in V]\n for i in range(grid_size):\n for j in range(grid_size):\n if (i, j) in [(0, 0), (0, grid_size-1), (grid_size-1, 0), (grid_size-1, grid_size-1)]:\n continue\n v = 0.0\n for action, prob in policy[(i, j)].items():\n di, dj = actions[action]\n ni = i + di if 0 <= i + di < grid_size else i\n nj = j + dj if 0 <= j + dj < grid_size else j\n v += prob * (reward + gamma * V[ni][nj])\n new_V[i][j] = v\n delta = max(delta, abs(V[i][j] - new_V[i][j]))\n V = new_V\n if delta < threshold:\n break\n return V",
4445
"example": {
45-
"input": "import numpy as np\nx = np.array([0, 5, 10])\nc, d = 2, 4\nprint(convert_range(x, c, d))",
46-
"output": "[2. 3. 4.]",
47-
"reasoning": "The minimum value (a) is 0 and the maximum value (b) is 10. The formula maps 0 to 2, 5 to 3, and 10 to 4 using: f(x) = c + (d-c)/(b-a)*(x-a)."
46+
"input": "policy = {(i, j): {'up': 0.25, 'down': 0.25, 'left': 0.25, 'right': 0.25} for i in range(5) for j in range(5)}\ngamma = 0.9\nthreshold = 0.001\nV = gridworld_policy_evaluation(policy, gamma, threshold)\nprint(round(V[2][2], 4))",
47+
"output": "-7.0902",
48+
"reasoning": "The policy is uniform (equal chance of each move). The agent receives -1 per step. After iterative updates, the center state value converges to about -7.09, and corners remain at 0."
4849
},
49-
"category": "Machine Learning",
50-
"starter_code": "import numpy as np\n\ndef convert_range(values: np.ndarray, c: float, d: float) -> np.ndarray:\n \"\"\"\n Shift and scale values from their original range [min, max] to a target [c, d] range.\n \"\"\"\n # Your code here\n pass",
51-
"title": "Shift and Scale Array to Target Range",
52-
"learn_section": "# **Shifting and Scaling a Range (Rescaling Data)**\n\n## **1. Motivation**\n\nRescaling (or shifting and scaling) is a common preprocessing step in data analysis and machine learning. It's often necessary to map data from an original range (e.g., test scores, pixel values, GPA) to a new range suitable for downstream tasks or compatibility between datasets. For example, you might want to shift a GPA from $[0, 10]$ to $[0, 4]$ for comparison or model input.\n\n---\n\n## **2. The General Mapping Formula**\n\nSuppose you have input values in the range $[a, b]$ and you want to map them to the interval $[c, d]$.\n\n- First, shift the lower bound to $0$ by applying $x \\mapsto x - a$, so $[a, b] \\rightarrow [0, b-a]$.\n- Next, scale to unit interval: $t \\mapsto \\frac{1}{b-a} \\cdot t$, yielding $[0, 1]$.\n- Now, scale to $[0, d-c]$ with $t \\mapsto (d-c)t$, and shift to $[c, d]$ with $t \\mapsto c + t$.\n- Combining all steps, the complete formula is:\n\n$$\n f(x) = c + \\left(\\frac{d-c}{b-a}\\right)(x-a)\n$$\n\n- $x$ = the input value\n- $a = \\min(x)$ and $b = \\max(x)$\n- $c$, $d$ = target interval endpoints\n\n---\n\n## **3. Applications**\n- **Image Processing**: Rescale pixel intensities\n- **Feature Engineering**: Normalize features to a common range\n- **Score Conversion**: Convert test scores or grades between systems\n\n---\n\n## **4. Practical Considerations**\n- Be aware of the case when $a = b$ (constant input); this may require special handling (e.g., output all $c$).\n- For multidimensional arrays, use NumPy’s `.min()` and `.max()` to determine the full input range.\n\n---\n\nThis formula gives a **simple, mathematically justified way to shift and scale data to any target range**—a core tool for robust machine learning pipelines.\n",
50+
"category": "Reinforcement Learning",
51+
"starter_code": "def gridworld_policy_evaluation(policy: dict, gamma: float, threshold: float) -> list[list[float]]:\n \"\"\"\n Evaluate state-value function for a policy on a 5x5 gridworld.\n \n Args:\n policy: dict mapping (row, col) to action probability dicts\n gamma: discount factor\n threshold: convergence threshold\n Returns:\n 5x5 list of floats\n \"\"\"\n # Your code here\n pass",
52+
"learn_section": r"""# Gridworld Policy Evaluation
53+
54+
In reinforcement learning, **policy evaluation** is the process of computing the state-value function for a given policy. For a gridworld environment, this involves iteratively updating the value of each state based on the expected return following the policy.
55+
56+
## Key Concepts
57+
58+
- **State-Value Function (V):**
59+
The expected return when starting from a state and following a given policy.
60+
61+
- **Policy:**
62+
A mapping from states to probabilities of selecting each available action.
63+
64+
- **Bellman Expectation Equation:**
65+
For each state $s$:
66+
$$
67+
V(s) = \sum_{a} \pi(a|s) \sum_{s'} P(s'|s,a) [R(s,a,s') + \gamma V(s')]
68+
$$
69+
where:
70+
- $ \pi(a|s) $ is the probability of taking action $ a $ in state $ s $,
71+
- $ P(s'|s,a) $ is the probability of transitioning to state $ s' $,
72+
- $ R(s,a,s') $ is the reward for that transition,
73+
- $ \gamma $ is the discount factor.
74+
75+
## Algorithm Overview
76+
77+
1. **Initialization:**
78+
Start with an initial guess (commonly zeros) for the state-value function $ V(s) $.
79+
80+
2. **Iterative Update:**
81+
For each non-terminal state, update the state value using the Bellman expectation equation. Continue updating until the maximum change in value (delta) is less than a given threshold.
82+
83+
3. **Terminal States:**
84+
For this example, the four corners of the grid are considered terminal, so their values remain unchanged.
85+
86+
This evaluation method is essential for understanding how "good" each state is under a specific policy, and it forms the basis for more advanced reinforcement learning algorithms.""",
5387
"contributor": [
5488
{
55-
"profile_link": "https://github.com/turkunov",
56-
"name": "turkunov"
89+
"profile_link": "https://github.com/arpitsinghgautam",
90+
"name": "Arpit Singh Gautam"
5791
}
5892
],
5993
"likes": "0",
6094
"dislikes": "0",
61-
"difficulty": "easy",
95+
"difficulty": "medium",
6296
"video": ""
6397
}
6498

6599

100+
66101
# ────────────────────────────────────────────────────────────────────────────
67102

68103

0 commit comments

Comments
 (0)