Merge pull request #400 from arpitsinghgautam/gridworld_policy_evaluation_rl

Open-Deep-ML · web-flow · commit 2b8a6884562e · 2025-06-20T13:57:46.000-04:00
Added Gridworld Policy Evaluation RL Problem
diff --git a/Problems/398_gridworld_policy_evaluation/learn.html b/Problems/398_gridworld_policy_evaluation/learn.html
@@ -0,0 +1,38 @@
+<!DOCTYPE html>
+<html>
+<head>
+  <meta charset="UTF-8">
+  <title>Gridworld Policy Evaluation</title>
+</head>
+<body>
+  <h2>Gridworld Policy Evaluation</h2>
+  <p>
+    In reinforcement learning, <strong>policy evaluation</strong> is the process of computing the state-value function for a given policy. In a gridworld environment, this involves iteratively updating the value of each state based on the expected return from following the policy.
+  </p>
+  
+  <h3>Key Concepts</h3>
+  <ul>
+    <li><strong>State-Value Function (V):</strong> The expected return when starting from a state and following a policy.</li>
+    <li><strong>Policy:</strong> A mapping from states to probabilities for each available action.</li>
+    <li><strong>Bellman Expectation Equation:</strong>
+      <p>
+        For each state \( s \):<br>
+        \[
+        V(s) = \sum_{a} \pi(a|s) \sum_{s'} P(s'|s,a) [R(s,a,s') + \gamma V(s')]
+        \]
+      </p>
+    </li>
+  </ul>
+
+  <h3>Algorithm Overview</h3>
+  <ol>
+    <li><strong>Initialization:</strong> Start with an initial guess (e.g., zeros) for the state-value function \( V(s) \).</li>
+    <li><strong>Iterative Update:</strong> Update the state value for each non-terminal state using the Bellman equation until the maximum change is less than a set threshold.</li>
+    <li><strong>Terminal States:</strong> For this task, terminal states (the four corners) remain unchanged.</li>
+  </ol>
+
+  <p>
+    This method provides a foundation for assessing the quality of states under a given policy, which is crucial for many reinforcement learning techniques.
+  </p>
+</body>
+</html>
diff --git a/Problems/398_gridworld_policy_evaluation/learn.md b/Problems/398_gridworld_policy_evaluation/learn.md
@@ -0,0 +1,35 @@
+# Gridworld Policy Evaluation
+
+In reinforcement learning, **policy evaluation** is the process of computing the state-value function for a given policy. For a gridworld environment, this involves iteratively updating the value of each state based on the expected return following the policy.
+
+## Key Concepts
+
+- **State-Value Function (V):**  
+  The expected return when starting from a state and following a given policy.
+
+- **Policy:**  
+  A mapping from states to probabilities of selecting each available action.
+
+- **Bellman Expectation Equation:**  
+  For each state $s$:
+  $$
+  V(s) = \sum_{a} \pi(a|s) \sum_{s'} P(s'|s,a) [R(s,a,s') + \gamma V(s')]
+  $$
+  where:
+  - $ \pi(a|s) $ is the probability of taking action $ a $ in state $ s $,
+  - $ P(s'|s,a) $ is the probability of transitioning to state $ s' $,
+  - $ R(s,a,s') $ is the reward for that transition,
+  - $ \gamma $ is the discount factor.
+
+## Algorithm Overview
+
+1. **Initialization:**  
+   Start with an initial guess (commonly zeros) for the state-value function $ V(s) $.
+
+2. **Iterative Update:**  
+   For each non-terminal state, update the state value using the Bellman expectation equation. Continue updating until the maximum change in value (delta) is less than a given threshold.
+
+3. **Terminal States:**  
+   For this example, the four corners of the grid are considered terminal, so their values remain unchanged.
+
+This evaluation method is essential for understanding how "good" each state is under a specific policy, and it forms the basis for more advanced reinforcement learning algorithms.
diff --git a/Problems/398_gridworld_policy_evaluation/question.md b/Problems/398_gridworld_policy_evaluation/question.md
@@ -0,0 +1,32 @@
+# Gridworld Policy Evaluation
+
+Implement a function that evaluates the state-value function for a 5x5 gridworld under a given policy. In this gridworld, the agent can move in four directions: up, down, left, and right. Each move incurs a constant reward of -1, and terminal states (the four corners) remain unchanged. The policy is provided as a dictionary mapping each state (tuple: (row, col)) to a dictionary of action probabilities.
+
+## Example
+
+**Input:**
+
+```python
+policy = {
+    (i, j): {'up': 0.25, 'down': 0.25, 'left': 0.25, 'right': 0.25}
+    for i in range(5) for j in range(5)
+}
+gamma = 0.9
+threshold = 0.001
+```
+
+## Output:
+
+A 5x5 list of state values that converges after iterative evaluation.
+
+```
+[0.0, -4.864480919478529, -6.078955203735765, -4.864480919478529, 0.0]
+[-4.864480919478529, -6.23388594292537, -6.7676569349718365, -6.233885942925371, -4.864480919478529]
+[-6.078955203735764, -6.7676569349718365, -7.090189335232064, -6.7676569349718365, -6.078955203735764]
+[-4.864480919478529, -6.23388594292537, -6.7676569349718365, -6.233885942925371, -4.864480919478529]
+[0.0, -4.864480919478529, -6.078955203735765, -4.864480919478529, 0.0]
+```
+
+## Reasoning:
+
+For each non-terminal state, compute the expected value over all possible actions using the policy. Update the state value iteratively using the Bellman expectation equation until the maximum change across states is below the threshold, ensuring that terminal states remain fixed.
diff --git a/Problems/398_gridworld_policy_evaluation/solution.py b/Problems/398_gridworld_policy_evaluation/solution.py
@@ -0,0 +1,89 @@
+def gridworld_policy_evaluation(policy: dict, gamma: float, threshold: float) -> list[list[float]]:
+    """
+    Evaluate the state-value function for a given policy on a 5x5 gridworld.
+    
+    Parameters:
+    - policy: A dictionary mapping each state (tuple: (row, col)) to a dictionary of action probabilities.
+    - gamma: Discount factor.
+    - threshold: Convergence threshold.
+    
+    Returns:
+    - A 5x5 list representing the state-value function.
+    """
+    grid_size = 5
+    # Initialize state-value function to zeros
+    V = [[0.0 for _ in range(grid_size)] for _ in range(grid_size)]
+    # Define actions with their effects: up, down, left, right.
+    actions = {
+        'up': (-1, 0),
+        'down': (1, 0),
+        'left': (0, -1),
+        'right': (0, 1)
+    }
+    # Constant reward per move
+    reward = -1
+    
+    while True:
+        delta = 0.0
+        new_V = [row[:] for row in V]
+        for i in range(grid_size):
+            for j in range(grid_size):
+                # For simplicity, assume corners are terminal states
+                if (i, j) in [(0, 0), (0, grid_size-1), (grid_size-1, 0), (grid_size-1, grid_size-1)]:
+                    continue
+                v = 0.0
+                # Update state value based on action probabilities
+                for action, prob in policy[(i, j)].items():
+                    di, dj = actions[action]
+                    # If the move goes off-grid, the agent stays in the same state
+                    new_i = i + di if 0 <= i + di < grid_size else i
+                    new_j = j + dj if 0 <= j + dj < grid_size else j
+                    v += prob * (reward + gamma * V[new_i][new_j])
+                new_V[i][j] = v
+                delta = max(delta, abs(V[i][j] - new_V[i][j]))
+        V = new_V
+        if delta < threshold:
+            break
+    return V
+
+def test_gridworld_policy_evaluation() -> None:
+    grid_size = 5
+    gamma = 0.9
+    threshold = 0.001
+
+    # Policy 1: Uniform policy for all non-terminal states.
+    policy1 = {
+        (i, j): {'up': 0.25, 'down': 0.25, 'left': 0.25, 'right': 0.25}
+        for i in range(grid_size) for j in range(grid_size)
+    }
+    
+    # Policy 2: Biased policy favoring 'down' and 'right'.
+    policy2 = {
+        (i, j): {'up': 0.1, 'down': 0.4, 'left': 0.1, 'right': 0.4}
+        for i in range(grid_size) for j in range(grid_size)
+    }
+    
+    # Policy 3: Randomized policy (for illustration, probabilities sum to 1 but vary per state)
+    # Here, we provide a different fixed set for all states.
+    policy3 = {
+        (i, j): {'up': 0.2, 'down': 0.3, 'left': 0.3, 'right': 0.2}
+        for i in range(grid_size) for j in range(grid_size)
+    }
+    
+    policies = [policy1, policy2, policy3]
+    
+    for idx, policy in enumerate(policies, start=1):
+        print(f"\nTesting Policy {idx}")
+        # Test case 1: Verify grid dimensions
+        V = gridworld_policy_evaluation(policy, gamma, threshold)
+        assert len(V) == grid_size and all(len(row) == grid_size for row in V), "Grid dimension error"
+        
+        # Test case 2: Check that terminal states (corners) remain unchanged (value = 0)
+        assert V[0][0] == 0 and V[0][grid_size-1] == 0 and V[grid_size-1][0] == 0 and V[grid_size-1][grid_size-1] == 0, "Terminal state value should be unchanged"
+        
+        # Test case 3: Verify that non-terminal state values are negative due to -1 reward per move
+        assert V[2][2] < 0, "State value should be negative due to constant negative reward"
+        print(f"All tests passed for Policy {idx}.")
+
+if __name__ == "__main__":
+    test_gridworld_policy_evaluation()