Skip to content

Commit 2b8a688

Browse files
authored
Merge pull request #400 from arpitsinghgautam/gridworld_policy_evaluation_rl
Added Gridworld Policy Evaluation RL Problem
2 parents cae7fec + 3b7c937 commit 2b8a688

File tree

4 files changed

+194
-0
lines changed

4 files changed

+194
-0
lines changed
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
<!DOCTYPE html>
2+
<html>
3+
<head>
4+
<meta charset="UTF-8">
5+
<title>Gridworld Policy Evaluation</title>
6+
</head>
7+
<body>
8+
<h2>Gridworld Policy Evaluation</h2>
9+
<p>
10+
In reinforcement learning, <strong>policy evaluation</strong> is the process of computing the state-value function for a given policy. In a gridworld environment, this involves iteratively updating the value of each state based on the expected return from following the policy.
11+
</p>
12+
13+
<h3>Key Concepts</h3>
14+
<ul>
15+
<li><strong>State-Value Function (V):</strong> The expected return when starting from a state and following a policy.</li>
16+
<li><strong>Policy:</strong> A mapping from states to probabilities for each available action.</li>
17+
<li><strong>Bellman Expectation Equation:</strong>
18+
<p>
19+
For each state \( s \):<br>
20+
\[
21+
V(s) = \sum_{a} \pi(a|s) \sum_{s'} P(s'|s,a) [R(s,a,s') + \gamma V(s')]
22+
\]
23+
</p>
24+
</li>
25+
</ul>
26+
27+
<h3>Algorithm Overview</h3>
28+
<ol>
29+
<li><strong>Initialization:</strong> Start with an initial guess (e.g., zeros) for the state-value function \( V(s) \).</li>
30+
<li><strong>Iterative Update:</strong> Update the state value for each non-terminal state using the Bellman equation until the maximum change is less than a set threshold.</li>
31+
<li><strong>Terminal States:</strong> For this task, terminal states (the four corners) remain unchanged.</li>
32+
</ol>
33+
34+
<p>
35+
This method provides a foundation for assessing the quality of states under a given policy, which is crucial for many reinforcement learning techniques.
36+
</p>
37+
</body>
38+
</html>
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# Gridworld Policy Evaluation
2+
3+
In reinforcement learning, **policy evaluation** is the process of computing the state-value function for a given policy. For a gridworld environment, this involves iteratively updating the value of each state based on the expected return following the policy.
4+
5+
## Key Concepts
6+
7+
- **State-Value Function (V):**
8+
The expected return when starting from a state and following a given policy.
9+
10+
- **Policy:**
11+
A mapping from states to probabilities of selecting each available action.
12+
13+
- **Bellman Expectation Equation:**
14+
For each state $s$:
15+
$$
16+
V(s) = \sum_{a} \pi(a|s) \sum_{s'} P(s'|s,a) [R(s,a,s') + \gamma V(s')]
17+
$$
18+
where:
19+
- $ \pi(a|s) $ is the probability of taking action $ a $ in state $ s $,
20+
- $ P(s'|s,a) $ is the probability of transitioning to state $ s' $,
21+
- $ R(s,a,s') $ is the reward for that transition,
22+
- $ \gamma $ is the discount factor.
23+
24+
## Algorithm Overview
25+
26+
1. **Initialization:**
27+
Start with an initial guess (commonly zeros) for the state-value function $ V(s) $.
28+
29+
2. **Iterative Update:**
30+
For each non-terminal state, update the state value using the Bellman expectation equation. Continue updating until the maximum change in value (delta) is less than a given threshold.
31+
32+
3. **Terminal States:**
33+
For this example, the four corners of the grid are considered terminal, so their values remain unchanged.
34+
35+
This evaluation method is essential for understanding how "good" each state is under a specific policy, and it forms the basis for more advanced reinforcement learning algorithms.
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# Gridworld Policy Evaluation
2+
3+
Implement a function that evaluates the state-value function for a 5x5 gridworld under a given policy. In this gridworld, the agent can move in four directions: up, down, left, and right. Each move incurs a constant reward of -1, and terminal states (the four corners) remain unchanged. The policy is provided as a dictionary mapping each state (tuple: (row, col)) to a dictionary of action probabilities.
4+
5+
## Example
6+
7+
**Input:**
8+
9+
```python
10+
policy = {
11+
(i, j): {'up': 0.25, 'down': 0.25, 'left': 0.25, 'right': 0.25}
12+
for i in range(5) for j in range(5)
13+
}
14+
gamma = 0.9
15+
threshold = 0.001
16+
```
17+
18+
## Output:
19+
20+
A 5x5 list of state values that converges after iterative evaluation.
21+
22+
```
23+
[0.0, -4.864480919478529, -6.078955203735765, -4.864480919478529, 0.0]
24+
[-4.864480919478529, -6.23388594292537, -6.7676569349718365, -6.233885942925371, -4.864480919478529]
25+
[-6.078955203735764, -6.7676569349718365, -7.090189335232064, -6.7676569349718365, -6.078955203735764]
26+
[-4.864480919478529, -6.23388594292537, -6.7676569349718365, -6.233885942925371, -4.864480919478529]
27+
[0.0, -4.864480919478529, -6.078955203735765, -4.864480919478529, 0.0]
28+
```
29+
30+
## Reasoning:
31+
32+
For each non-terminal state, compute the expected value over all possible actions using the policy. Update the state value iteratively using the Bellman expectation equation until the maximum change across states is below the threshold, ensuring that terminal states remain fixed.
Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
def gridworld_policy_evaluation(policy: dict, gamma: float, threshold: float) -> list[list[float]]:
2+
"""
3+
Evaluate the state-value function for a given policy on a 5x5 gridworld.
4+
5+
Parameters:
6+
- policy: A dictionary mapping each state (tuple: (row, col)) to a dictionary of action probabilities.
7+
- gamma: Discount factor.
8+
- threshold: Convergence threshold.
9+
10+
Returns:
11+
- A 5x5 list representing the state-value function.
12+
"""
13+
grid_size = 5
14+
# Initialize state-value function to zeros
15+
V = [[0.0 for _ in range(grid_size)] for _ in range(grid_size)]
16+
# Define actions with their effects: up, down, left, right.
17+
actions = {
18+
'up': (-1, 0),
19+
'down': (1, 0),
20+
'left': (0, -1),
21+
'right': (0, 1)
22+
}
23+
# Constant reward per move
24+
reward = -1
25+
26+
while True:
27+
delta = 0.0
28+
new_V = [row[:] for row in V]
29+
for i in range(grid_size):
30+
for j in range(grid_size):
31+
# For simplicity, assume corners are terminal states
32+
if (i, j) in [(0, 0), (0, grid_size-1), (grid_size-1, 0), (grid_size-1, grid_size-1)]:
33+
continue
34+
v = 0.0
35+
# Update state value based on action probabilities
36+
for action, prob in policy[(i, j)].items():
37+
di, dj = actions[action]
38+
# If the move goes off-grid, the agent stays in the same state
39+
new_i = i + di if 0 <= i + di < grid_size else i
40+
new_j = j + dj if 0 <= j + dj < grid_size else j
41+
v += prob * (reward + gamma * V[new_i][new_j])
42+
new_V[i][j] = v
43+
delta = max(delta, abs(V[i][j] - new_V[i][j]))
44+
V = new_V
45+
if delta < threshold:
46+
break
47+
return V
48+
49+
def test_gridworld_policy_evaluation() -> None:
50+
grid_size = 5
51+
gamma = 0.9
52+
threshold = 0.001
53+
54+
# Policy 1: Uniform policy for all non-terminal states.
55+
policy1 = {
56+
(i, j): {'up': 0.25, 'down': 0.25, 'left': 0.25, 'right': 0.25}
57+
for i in range(grid_size) for j in range(grid_size)
58+
}
59+
60+
# Policy 2: Biased policy favoring 'down' and 'right'.
61+
policy2 = {
62+
(i, j): {'up': 0.1, 'down': 0.4, 'left': 0.1, 'right': 0.4}
63+
for i in range(grid_size) for j in range(grid_size)
64+
}
65+
66+
# Policy 3: Randomized policy (for illustration, probabilities sum to 1 but vary per state)
67+
# Here, we provide a different fixed set for all states.
68+
policy3 = {
69+
(i, j): {'up': 0.2, 'down': 0.3, 'left': 0.3, 'right': 0.2}
70+
for i in range(grid_size) for j in range(grid_size)
71+
}
72+
73+
policies = [policy1, policy2, policy3]
74+
75+
for idx, policy in enumerate(policies, start=1):
76+
print(f"\nTesting Policy {idx}")
77+
# Test case 1: Verify grid dimensions
78+
V = gridworld_policy_evaluation(policy, gamma, threshold)
79+
assert len(V) == grid_size and all(len(row) == grid_size for row in V), "Grid dimension error"
80+
81+
# Test case 2: Check that terminal states (corners) remain unchanged (value = 0)
82+
assert V[0][0] == 0 and V[0][grid_size-1] == 0 and V[grid_size-1][0] == 0 and V[grid_size-1][grid_size-1] == 0, "Terminal state value should be unchanged"
83+
84+
# Test case 3: Verify that non-terminal state values are negative due to -1 reward per move
85+
assert V[2][2] < 0, "State value should be negative due to constant negative reward"
86+
print(f"All tests passed for Policy {idx}.")
87+
88+
if __name__ == "__main__":
89+
test_gridworld_policy_evaluation()

0 commit comments

Comments
 (0)