Skip to content

Commit cef58f9

Browse files
authored
Merge pull request #502 from Open-Deep-ML/new-Q-158-e-greed
added a new question for n armed bandit
2 parents 74dc570 + b549fda commit cef58f9

File tree

8 files changed

+76
-24
lines changed

8 files changed

+76
-24
lines changed
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Implement the epsilon-greedy method for action selection in an n-armed bandit problem. Given a set of estimated action values (Q-values), select an action using the epsilon-greedy policy: with probability epsilon, choose a random action; with probability 1 - epsilon, choose the action with the highest estimated value.
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{
2+
"input": "Q = np.array([0.5, 2.3, 1.7])\nepsilon = 0.0\naction = epsilon_greedy(Q, epsilon)\nprint(action)",
3+
"output": "1",
4+
"reasoning": "With epsilon=0.0 (always greedy), the highest Q-value is 2.3 at index 1, so the function always returns 1."
5+
}
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
### Epsilon-Greedy Policy
2+
3+
The epsilon-greedy method is a fundamental action selection strategy used in reinforcement learning, especially for solving the n-armed bandit problem. The key idea is to balance **exploration** (trying new actions) and **exploitation** (choosing the best-known action):
4+
5+
- With probability $\varepsilon$ (epsilon), the agent explores by selecting an action at random.
6+
- With probability $1-\varepsilon$, it exploits by choosing the action with the highest estimated value (greedy choice).
7+
8+
The epsilon-greedy policy is simple to implement and provides a way to avoid getting stuck with suboptimal actions due to insufficient exploration.
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
{
2+
"id": "158",
3+
"title": "Epsilon-Greedy Action Selection for n-Armed Bandit",
4+
"difficulty": "easy",
5+
"category": "Reinforcement Learning",
6+
"video": "",
7+
"likes": "0",
8+
"dislikes": "0",
9+
"contributor": []
10+
}
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
import numpy as np
2+
3+
def epsilon_greedy(Q, epsilon=0.1):
4+
if np.random.rand() < epsilon:
5+
return np.random.randint(len(Q))
6+
else:
7+
return int(np.argmax(Q))
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
import numpy as np
2+
3+
def epsilon_greedy(Q, epsilon=0.1):
4+
"""
5+
Selects an action using epsilon-greedy policy.
6+
Q: np.ndarray of shape (n,) -- estimated action values
7+
epsilon: float in [0, 1]
8+
Returns: int, selected action index
9+
"""
10+
# Your code here
11+
pass
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
[
2+
{
3+
"test": "import numpy as np\nnp.random.seed(0)\nprint([epsilon_greedy(np.array([1, 2, 3]), epsilon=0.0) for _ in range(5)])",
4+
"expected_output": "[2, 2, 2, 2, 2]"
5+
},
6+
{
7+
"test": "import numpy as np\nnp.random.seed(1)\nprint([epsilon_greedy(np.array([5, 2, 1]), epsilon=1.0) for _ in range(5)])",
8+
"expected_output": "[0, 1, 1, 0, 0]"
9+
},
10+
{
11+
"test": "import numpy as np\nnp.random.seed(42)\nresults = [epsilon_greedy(np.array([1.5, 2.5, 0.5]), epsilon=0.5) for _ in range(10)]\nprint(results)",
12+
"expected_output": "[1, 0, 1, 1, 1, 0, 1, 0, 0, 0]"
13+
}
14+
]

utils/convert_single_question.py

Lines changed: 20 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -28,42 +28,38 @@
2828

2929
# ── 1️⃣ EDIT YOUR QUESTION HERE ────────────────────────────────────────────
3030
QUESTION_DICT: Dict[str, Any] = {
31-
"id": "157",
32-
"title": "Implement the Bellman Equation for Value Iteration",
33-
"description": "Write a function that performs one step of value iteration for a given Markov Decision Process (MDP) using the Bellman equation. The function should update the state-value function V(s) for each state based on possible actions, transition probabilities, rewards, and the discount factor gamma. Only use NumPy.",
31+
"id":'158',
32+
"title": "Epsilon-Greedy Action Selection for n-Armed Bandit",
33+
"description": "Implement the epsilon-greedy method for action selection in an n-armed bandit problem. Given a set of estimated action values (Q-values), select an action using the epsilon-greedy policy: with probability epsilon, choose a random action; with probability 1 - epsilon, choose the action with the highest estimated value.",
34+
"category": "Reinforcement Learning",
35+
"difficulty": "easy",
36+
"starter_code": "import numpy as np\n\ndef epsilon_greedy(Q, epsilon=0.1):\n \"\"\"\n Selects an action using epsilon-greedy policy.\n Q: np.ndarray of shape (n,) -- estimated action values\n epsilon: float in [0, 1]\n Returns: int, selected action index\n \"\"\"\n # Your code here\n pass",
37+
"solution": "import numpy as np\n\ndef epsilon_greedy(Q, epsilon=0.1):\n if np.random.rand() < epsilon:\n return np.random.randint(len(Q))\n else:\n return int(np.argmax(Q))",
3438
"test_cases": [
3539
{
36-
"test": "import numpy as np\ntransitions = [\n # For state 0\n {0: [(1.0, 0, 0.0, False)], 1: [(1.0, 1, 1.0, False)]},\n # For state 1\n {0: [(1.0, 0, 0.0, False)], 1: [(1.0, 1, 1.0, True)]}\n]\nV = np.array([0.0, 0.0])\ngamma = 0.9\nnew_V = bellman_update(V, transitions, gamma)\nprint(np.round(new_V, 2))",
37-
"expected_output": "[1., 1.]"
40+
"test": "import numpy as np\nnp.random.seed(0)\nprint([epsilon_greedy(np.array([1, 2, 3]), epsilon=0.0) for _ in range(5)])",
41+
"expected_output": "[2, 2, 2, 2, 2]"
42+
},
43+
{
44+
"test": "import numpy as np\nnp.random.seed(1)\nprint([epsilon_greedy(np.array([5, 2, 1]), epsilon=1.0) for _ in range(5)])",
45+
"expected_output": "[0, 1, 1, 0, 0]"
3846
},
3947
{
40-
"test": "import numpy as np\ntransitions = [\n {0: [(0.8, 0, 5, False), (0.2, 1, 10, False)], 1: [(1.0, 1, 2, False)]},\n {0: [(1.0, 0, 0, False)], 1: [(1.0, 1, 0, True)]}\n]\nV = np.array([0.0, 0.0])\ngamma = 0.5\nnew_V = bellman_update(V, transitions, gamma)\nprint(np.round(new_V, 2))",
41-
"expected_output": "[6., 0.]"
48+
"test": "import numpy as np\nnp.random.seed(42)\nresults = [epsilon_greedy(np.array([1.5, 2.5, 0.5]), epsilon=0.5) for _ in range(10)]\nprint(results)",
49+
"expected_output": "[1, 0, 1, 1, 1, 0, 1, 0, 0, 0]"
4250
}
4351
],
44-
"solution": "import numpy as np\n\ndef bellman_update(V, transitions, gamma):\n n_states = len(V)\n new_V = np.zeros_like(V)\n for s in range(n_states):\n action_values = []\n for a in transitions[s]:\n total = 0\n for prob, next_s, reward, done in transitions[s][a]:\n total += prob * (reward + gamma * (0 if done else V[next_s]))\n action_values.append(total)\n new_V[s] = max(action_values)\n return new_V",
4552
"example": {
46-
"input": "import numpy as np\ntransitions = [\n {0: [(1.0, 0, 0.0, False)], 1: [(1.0, 1, 1.0, False)]},\n {0: [(1.0, 0, 0.0, False)], 1: [(1.0, 1, 1.0, True)]}\n]\nV = np.array([0.0, 0.0])\ngamma = 0.9\nnew_V = bellman_update(V, transitions, gamma)\nprint(np.round(new_V, 2))",
47-
"output": "[1. 1.]",
48-
"reasoning": "For state 0, the best action is to go to state 1 and get a reward of 1. For state 1, taking action 1 gives a reward of 1 and ends the episode, so its value is 1."
53+
"input": "Q = np.array([0.5, 2.3, 1.7])\nepsilon = 0.0\naction = epsilon_greedy(Q, epsilon)\nprint(action)",
54+
"output": "1",
55+
"reasoning": "With epsilon=0.0 (always greedy), the highest Q-value is 2.3 at index 1, so the function always returns 1."
4956
},
50-
"category": "Reinforcement Learning",
51-
"starter_code": "import numpy as np\n\ndef bellman_update(V, transitions, gamma):\n \"\"\"\n Perform one step of value iteration using the Bellman equation.\n Args:\n V: np.ndarray, state values, shape (n_states,)\n transitions: list of dicts. transitions[s][a] is a list of (prob, next_state, reward, done)\n gamma: float, discount factor\n Returns:\n np.ndarray, updated state values\n \"\"\"\n # TODO: Implement Bellman update\n pass",
52-
"learn_section": "# **The Bellman Equation**\n\nThe **Bellman equation** is a fundamental recursive equation in reinforcement learning that relates the value of a state to the values of possible next states. It provides the mathematical foundation for key RL algorithms such as value iteration and Q-learning.\n\n---\n\n## **Key Idea**\nFor each state $s$, the value $V(s)$ is the maximum expected return obtainable by choosing the best action $a$ and then following the optimal policy:\n\n$$\nV(s) = \\max_{a} \\sum_{s'} P(s'|s, a) \\left[ R(s, a, s') + \\gamma V(s') \\right]\n$$\n\nWhere:\n- $V(s)$: value of state $s$\n- $a$: possible actions\n- $P(s'|s, a)$: probability of moving to state $s'$ from $s$ via $a$\n- $R(s, a, s')$: reward for this transition\n- $\\gamma$: discount factor ($0 \\leq \\gamma \\leq 1$)\n- $V(s')$: value of next state\n\n---\n\n## **How to Use**\n1. **For each state:**\n - For each possible action, sum over possible next states, weighting by transition probability.\n - Add the immediate reward and the discounted value of the next state.\n - Choose the action with the highest expected value (for control).\n2. **Repeat until values converge** (value iteration) or as part of other RL updates.\n\n---\n\n## **Applications**\n- **Value Iteration** and **Policy Iteration** in Markov Decision Processes (MDP)\n- **Q-learning** and other RL algorithms\n- Calculating the optimal value function and policy in gridworlds, games, and general MDPs\n\n---\n\n## **Why It Matters**\n- The Bellman equation formalizes the notion of **optimality** in sequential decision-making.\n- It is a backbone for teaching agents to solve environments with rewards, uncertainty, and long-term planning.",
53-
"contributor": [
54-
{
55-
"profile_link": "https://github.com/moe18",
56-
"name": "Moe Chabot"
57-
}
58-
],
59-
"likes": "0",
60-
"dislikes": "0",
61-
"difficulty": "medium",
62-
"video": ""
57+
"learn_section": "### Epsilon-Greedy Policy\n\nThe epsilon-greedy method is a fundamental action selection strategy used in reinforcement learning, especially for solving the n-armed bandit problem. The key idea is to balance **exploration** (trying new actions) and **exploitation** (choosing the best-known action):\n\n- With probability $\\varepsilon$ (epsilon), the agent explores by selecting an action at random.\n- With probability $1-\\varepsilon$, it exploits by choosing the action with the highest estimated value (greedy choice).\n\nThe epsilon-greedy policy is simple to implement and provides a way to avoid getting stuck with suboptimal actions due to insufficient exploration."
6358
}
6459

6560

6661

62+
6763
# ────────────────────────────────────────────────────────────────────────────
6864

6965

0 commit comments

Comments
 (0)