Skip to content

Commit 62dc13a

Browse files
authored
Merge branch 'main' into new-Q-159
2 parents c403e5d + cef58f9 commit 62dc13a

File tree

8 files changed

+57
-27
lines changed

8 files changed

+57
-27
lines changed
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Implement the epsilon-greedy method for action selection in an n-armed bandit problem. Given a set of estimated action values (Q-values), select an action using the epsilon-greedy policy: with probability epsilon, choose a random action; with probability 1 - epsilon, choose the action with the highest estimated value.
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{
2+
"input": "Q = np.array([0.5, 2.3, 1.7])\nepsilon = 0.0\naction = epsilon_greedy(Q, epsilon)\nprint(action)",
3+
"output": "1",
4+
"reasoning": "With epsilon=0.0 (always greedy), the highest Q-value is 2.3 at index 1, so the function always returns 1."
5+
}
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
### Epsilon-Greedy Policy
2+
3+
The epsilon-greedy method is a fundamental action selection strategy used in reinforcement learning, especially for solving the n-armed bandit problem. The key idea is to balance **exploration** (trying new actions) and **exploitation** (choosing the best-known action):
4+
5+
- With probability $\varepsilon$ (epsilon), the agent explores by selecting an action at random.
6+
- With probability $1-\varepsilon$, it exploits by choosing the action with the highest estimated value (greedy choice).
7+
8+
The epsilon-greedy policy is simple to implement and provides a way to avoid getting stuck with suboptimal actions due to insufficient exploration.
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
{
2+
"id": "158",
3+
"title": "Epsilon-Greedy Action Selection for n-Armed Bandit",
4+
"difficulty": "easy",
5+
"category": "Reinforcement Learning",
6+
"video": "",
7+
"likes": "0",
8+
"dislikes": "0",
9+
"contributor": []
10+
}
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
import numpy as np
2+
3+
def epsilon_greedy(Q, epsilon=0.1):
4+
if np.random.rand() < epsilon:
5+
return np.random.randint(len(Q))
6+
else:
7+
return int(np.argmax(Q))
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
import numpy as np
2+
3+
def epsilon_greedy(Q, epsilon=0.1):
4+
"""
5+
Selects an action using epsilon-greedy policy.
6+
Q: np.ndarray of shape (n,) -- estimated action values
7+
epsilon: float in [0, 1]
8+
Returns: int, selected action index
9+
"""
10+
# Your code here
11+
pass
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
[
2+
{
3+
"test": "import numpy as np\nnp.random.seed(0)\nprint([epsilon_greedy(np.array([1, 2, 3]), epsilon=0.0) for _ in range(5)])",
4+
"expected_output": "[2, 2, 2, 2, 2]"
5+
},
6+
{
7+
"test": "import numpy as np\nnp.random.seed(1)\nprint([epsilon_greedy(np.array([5, 2, 1]), epsilon=1.0) for _ in range(5)])",
8+
"expected_output": "[0, 1, 1, 0, 0]"
9+
},
10+
{
11+
"test": "import numpy as np\nnp.random.seed(42)\nresults = [epsilon_greedy(np.array([1.5, 2.5, 0.5]), epsilon=0.5) for _ in range(10)]\nprint(results)",
12+
"expected_output": "[1, 0, 1, 1, 1, 0, 1, 0, 0, 0]"
13+
}
14+
]

utils/convert_single_question.py

Lines changed: 1 addition & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -29,33 +29,7 @@
2929
# ── 1️⃣ EDIT YOUR QUESTION HERE ────────────────────────────────────────────
3030
QUESTION_DICT: Dict[str, Any] = {
3131
"id":'158',
32-
"title": "Incremental Mean for Online Reward Estimation",
33-
"description": "Implement an efficient method to update the mean reward for a k-armed bandit action after receiving each new reward, **without storing the full history of rewards**. Given the previous mean estimate (Q_prev), the number of times the action has been selected (k), and a new reward (R), compute the updated mean using the incremental formula.\n\n**Note:** Using a regular mean that stores all past rewards will eventually run out of memory. Your solution should use only the previous mean, the count, and the new reward.",
34-
"category": "Reinforcement Learning",
35-
"difficulty": "easy",
36-
"starter_code": "def incremental_mean(Q_prev, k, R):\n \"\"\"\n Q_prev: previous mean estimate (float)\n k: number of times the action has been selected (int)\n R: new observed reward (float)\n Returns: new mean estimate (float)\n \"\"\"\n # Your code here\n pass\n",
37-
"solution": "def incremental_mean(Q_prev, k, R):\n return Q_prev + (1 / k) * (R - Q_prev)",
38-
"test_cases": [
39-
{
40-
"test": "Q = 0.0\nk = 1\nR = 5.0\nprint(round(incremental_mean(Q, k, R), 4))",
41-
"expected_output": "5.0"
42-
},
43-
{
44-
"test": "Q = 5.0\nk = 2\nR = 7.0\nprint(round(incremental_mean(Q, k, R), 4))",
45-
"expected_output": "6.0"
46-
},
47-
{
48-
"test": "Q = 6.0\nk = 3\nR = 4.0\nprint(round(incremental_mean(Q, k, R), 4))",
49-
"expected_output": "5.3333"
50-
}
51-
],
52-
"example": {
53-
"input": "Q_prev = 2.0\nk = 2\nR = 6.0\nnew_Q = incremental_mean(Q_prev, k, R)\nprint(round(new_Q, 2))",
54-
"output": "4.0",
55-
"reasoning": "The updated mean is Q_prev + (1/k) * (R - Q_prev) = 2.0 + (1/2)*(6.0 - 2.0) = 2.0 + 2.0 = 4.0"
56-
},
57-
"learn_section": "### Incremental Mean Update Rule\n\nThe incremental mean formula lets you update your estimate of the mean after each new observation, **without keeping all previous rewards in memory**. For the k-th reward $R_k$ and previous estimate $Q_{k}$:\n\n$$\nQ_{k+1} = Q_k + \\frac{1}{k} (R_k - Q_k)\n$$\n\nThis saves memory compared to the regular mean, which requires storing all past rewards and recalculating each time. The incremental rule is crucial for online learning and large-scale problems where storing all data is impractical."
58-
}
32+
5933

6034

6135

0 commit comments

Comments
 (0)