Skip to content

Commit 7291a94

Browse files
authored
Merge pull request #503 from Open-Deep-ML/new-Q-159
added new Q
2 parents cef58f9 + 62dc13a commit 7291a94

File tree

8 files changed

+53
-27
lines changed

8 files changed

+53
-27
lines changed
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
Implement an efficient method to update the mean reward for a k-armed bandit action after receiving each new reward, **without storing the full history of rewards**. Given the previous mean estimate (Q_prev), the number of times the action has been selected (k), and a new reward (R), compute the updated mean using the incremental formula.
2+
3+
**Note:** Using a regular mean that stores all past rewards will eventually run out of memory. Your solution should use only the previous mean, the count, and the new reward.
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{
2+
"input": "Q_prev = 2.0\nk = 2\nR = 6.0\nnew_Q = incremental_mean(Q_prev, k, R)\nprint(round(new_Q, 2))",
3+
"output": "4.0",
4+
"reasoning": "The updated mean is Q_prev + (1/k) * (R - Q_prev) = 2.0 + (1/2)*(6.0 - 2.0) = 2.0 + 2.0 = 4.0"
5+
}
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
### Incremental Mean Update Rule
2+
3+
The incremental mean formula lets you update your estimate of the mean after each new observation, **without keeping all previous rewards in memory**. For the k-th reward $R_k$ and previous estimate $Q_{k}$:
4+
5+
$$
6+
Q_{k+1} = Q_k + \frac{1}{k} (R_k - Q_k)
7+
$$
8+
9+
This saves memory compared to the regular mean, which requires storing all past rewards and recalculating each time. The incremental rule is crucial for online learning and large-scale problems where storing all data is impractical.
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
{
2+
"id": "158",
3+
"title": "Incremental Mean for Online Reward Estimation",
4+
"difficulty": "easy",
5+
"category": "Reinforcement Learning",
6+
"video": "",
7+
"likes": "0",
8+
"dislikes": "0",
9+
"contributor": []
10+
}
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
def incremental_mean(Q_prev, k, R):
2+
return Q_prev + (1 / k) * (R - Q_prev)
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
def incremental_mean(Q_prev, k, R):
2+
"""
3+
Q_prev: previous mean estimate (float)
4+
k: number of times the action has been selected (int)
5+
R: new observed reward (float)
6+
Returns: new mean estimate (float)
7+
"""
8+
# Your code here
9+
pass
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
[
2+
{
3+
"test": "Q = 0.0\nk = 1\nR = 5.0\nprint(round(incremental_mean(Q, k, R), 4))",
4+
"expected_output": "5.0"
5+
},
6+
{
7+
"test": "Q = 5.0\nk = 2\nR = 7.0\nprint(round(incremental_mean(Q, k, R), 4))",
8+
"expected_output": "6.0"
9+
},
10+
{
11+
"test": "Q = 6.0\nk = 3\nR = 4.0\nprint(round(incremental_mean(Q, k, R), 4))",
12+
"expected_output": "5.3333"
13+
}
14+
]

utils/convert_single_question.py

Lines changed: 1 addition & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -29,33 +29,7 @@
2929
# ── 1️⃣ EDIT YOUR QUESTION HERE ────────────────────────────────────────────
3030
QUESTION_DICT: Dict[str, Any] = {
3131
"id":'158',
32-
"title": "Epsilon-Greedy Action Selection for n-Armed Bandit",
33-
"description": "Implement the epsilon-greedy method for action selection in an n-armed bandit problem. Given a set of estimated action values (Q-values), select an action using the epsilon-greedy policy: with probability epsilon, choose a random action; with probability 1 - epsilon, choose the action with the highest estimated value.",
34-
"category": "Reinforcement Learning",
35-
"difficulty": "easy",
36-
"starter_code": "import numpy as np\n\ndef epsilon_greedy(Q, epsilon=0.1):\n \"\"\"\n Selects an action using epsilon-greedy policy.\n Q: np.ndarray of shape (n,) -- estimated action values\n epsilon: float in [0, 1]\n Returns: int, selected action index\n \"\"\"\n # Your code here\n pass",
37-
"solution": "import numpy as np\n\ndef epsilon_greedy(Q, epsilon=0.1):\n if np.random.rand() < epsilon:\n return np.random.randint(len(Q))\n else:\n return int(np.argmax(Q))",
38-
"test_cases": [
39-
{
40-
"test": "import numpy as np\nnp.random.seed(0)\nprint([epsilon_greedy(np.array([1, 2, 3]), epsilon=0.0) for _ in range(5)])",
41-
"expected_output": "[2, 2, 2, 2, 2]"
42-
},
43-
{
44-
"test": "import numpy as np\nnp.random.seed(1)\nprint([epsilon_greedy(np.array([5, 2, 1]), epsilon=1.0) for _ in range(5)])",
45-
"expected_output": "[0, 1, 1, 0, 0]"
46-
},
47-
{
48-
"test": "import numpy as np\nnp.random.seed(42)\nresults = [epsilon_greedy(np.array([1.5, 2.5, 0.5]), epsilon=0.5) for _ in range(10)]\nprint(results)",
49-
"expected_output": "[1, 0, 1, 1, 1, 0, 1, 0, 0, 0]"
50-
}
51-
],
52-
"example": {
53-
"input": "Q = np.array([0.5, 2.3, 1.7])\nepsilon = 0.0\naction = epsilon_greedy(Q, epsilon)\nprint(action)",
54-
"output": "1",
55-
"reasoning": "With epsilon=0.0 (always greedy), the highest Q-value is 2.3 at index 1, so the function always returns 1."
56-
},
57-
"learn_section": "### Epsilon-Greedy Policy\n\nThe epsilon-greedy method is a fundamental action selection strategy used in reinforcement learning, especially for solving the n-armed bandit problem. The key idea is to balance **exploration** (trying new actions) and **exploitation** (choosing the best-known action):\n\n- With probability $\\varepsilon$ (epsilon), the agent explores by selecting an action at random.\n- With probability $1-\\varepsilon$, it exploits by choosing the action with the highest estimated value (greedy choice).\n\nThe epsilon-greedy policy is simple to implement and provides a way to avoid getting stuck with suboptimal actions due to insufficient exploration."
58-
}
32+
5933

6034

6135

0 commit comments

Comments
 (0)