From 0e1a6e72a9e7346a5c32926ca44943af3422c61b Mon Sep 17 00:00:00 2001
From: Amir Mohammad Fakhimi
Date: Fri, 28 Feb 2025 21:14:53 +0330
Subject: [PATCH 1/4] Fixing some writing mistakes and a problem in An Example
of Gridworld part in intro-to-rl.md
---
docs/course_notes/intro-to-rl.md | 17 +++++++++--------
1 file changed, 9 insertions(+), 8 deletions(-)
diff --git a/docs/course_notes/intro-to-rl.md b/docs/course_notes/intro-to-rl.md
index 48de3094..3b4836cd 100644
--- a/docs/course_notes/intro-to-rl.md
+++ b/docs/course_notes/intro-to-rl.md
@@ -204,7 +204,7 @@ Two fundamental problems in sequential decision making:
2. **Planning**:
- A model of the environment is **known**
- - The agent performs computations with its model (w**ithout any external
+ - The agent performs computations with its model (**without any external
interaction**)
- The agent **improves** its policy, a.k.a. deliberation, reasoning, introspection, pondering, thought, search
@@ -289,18 +289,19 @@ In the provided Gridworld example, the agent starts from the yellow square and h
Fig4. Grid World Example
The agent's choice depends on:
-- The **discount factor ($\gamma$)**, which determines whether it prioritizes short-term or long-term rewards.
-- The **noise level**, which introduces randomness into actions.
+ - The **discount factor ($\gamma$)**, which determines whether it prioritizes short-term or long-term rewards.
+ - The **noise level**, which introduces randomness into actions.
Depending on the values of $\gamma$ and noise, the agent's behavior varies:
+
1. **$\gamma$ = 0.1, noise = 0.5:**
- - The agent **prefers the close exit (+1) but takes the risk of stepping into the cliff (-10).**
+ - The agent **prefers the close exit (+1) but doesn't take the risk of stepping into the cliff (-10).**
2. **$\gamma$ = 0.99, noise = 0:**
- - The agent **prefers the distant exit (+10) while avoiding the cliff (-10).**
+ - The agent **prefers the distant exit (+10) and takes the risk of the cliff (-10).**
3. **$\gamma$ = 0.99, noise = 0.5:**
- - The agent **still prefers the distant exit (+10), but due to noise, it risks the cliff (-10).**
+ - The agent **still prefers the distant exit (+10), but due to noise, it doesn't risk the cliff (-10).**
4. **$\gamma$ = 0.1, noise = 0:**
- - The agent **chooses the close exit (+1) while avoiding the cliff.**
+ - The agent **chooses the close exit (+1) and takes the risk of the cliff.**
### Stochastic Policy
@@ -445,4 +446,4 @@ Consider the Grid World example where the agent navigates to a goal while avoidi
[:fontawesome-brands-linkedin-in:](https://www.linkedin.com/in/masoud-tahmasbi-fard/){:target="_blank"}
-
\ No newline at end of file
+
From a6a86f58e44e20a444310d50ea1d3b3c989762f4 Mon Sep 17 00:00:00 2001
From: Amir Mohammad Fakhimi
Date: Fri, 28 Feb 2025 21:32:21 +0330
Subject: [PATCH 2/4] Added Bellman Optimality Equation for state value
function in value-based.md
---
docs/course_notes/value-based.md | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/docs/course_notes/value-based.md b/docs/course_notes/value-based.md
index 3c593d6b..59a151e5 100644
--- a/docs/course_notes/value-based.md
+++ b/docs/course_notes/value-based.md
@@ -44,6 +44,15 @@ Where:
This equation allows for the iterative computation of state values in a model-based setting.
+#### Bellman Optimality Equation for $V^*(s)$:
+The **Bellman Optimality Equation** for $V^*(s)$ expresses the optimal state value function. It is given by:
+
+$$
+V^*(s) = \max_a \mathbb{E} \left[ R_{t+1} + \gamma V^\*(S_{t+1}) \mid s_t = s, a_t = a \right]
+$$
+
+This shows that the optimal value at each state is the immediate reward plus the discounted maximum expected value from the next state, where the next action is chosen optimally.
+
---
### 1.2. Action Value Function $Q(s, a)$
@@ -80,7 +89,7 @@ Where:
The **Bellman Optimality Equation** for $Q^*(s, a)$ expresses the optimal action value function. It is given by:
$$
-Q^*(s, a) = \mathbb{E} \left[ R_{t+1} + \gamma \max_{a'} Q^*(s_{t+1}, a') \mid s_t = s, a_t = a \right]
+Q^*(s, a) = \mathbb{E} \left[ R_{t+1} + \gamma \max_{a'} Q^\*(s_{t+1}, a') \mid s_t = s, a_t = a \right]
$$
This shows that the optimal action value at each state-action pair is the immediate reward plus the discounted maximum expected value from the next state, where the next action is chosen optimally.
@@ -639,4 +648,4 @@ The choice of method depends on the environment, the availability of a model, an
[:fontawesome-brands-linkedin-in:](https://www.linkedin.com/in/ghazal-hosseini-mighan-8b911823a){:target="_blank"}
- -->
\ No newline at end of file
+ -->
From 34312f840191f3f691b1e8d3a80a83bc10d83e48 Mon Sep 17 00:00:00 2001
From: Amir Mohammad Fakhimi
Date: Sat, 1 Mar 2025 04:58:22 +0330
Subject: [PATCH 3/4] Fixing writing mistakes in value-based.md
---
docs/course_notes/value-based.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/docs/course_notes/value-based.md b/docs/course_notes/value-based.md
index 59a151e5..c6a4c26f 100644
--- a/docs/course_notes/value-based.md
+++ b/docs/course_notes/value-based.md
@@ -283,7 +283,7 @@ $$
\hat{I}_N = \frac{1}{N} \sum_{i=1}^{N} f(x_i),
$$
-where $ x_i $ are **independent** samples drawn from $ p(x) $. The **Law of Large Numbers (LLN)** ensures that as $N \to \infty$:
+where $x_i$ are **independent** samples drawn from $p(x)$. The **Law of Large Numbers (LLN)** ensures that as $N \to \infty$:
$$
\hat{I}_N \to I.
From f421eb60e7c9e0b99d7b1b9fa617732ab2c016e3 Mon Sep 17 00:00:00 2001
From: Amir Mohammad Fakhimi
Date: Sat, 1 Mar 2025 04:59:20 +0330
Subject: [PATCH 4/4] Fixed a typo in week2.md
---
docs/workshops/week2.md | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/docs/workshops/week2.md b/docs/workshops/week2.md
index c4de72ed..f2abc513 100644
--- a/docs/workshops/week2.md
+++ b/docs/workshops/week2.md
@@ -46,5 +46,5 @@ comments: True
### Notebook(s)
\ No newline at end of file
+ Workshop 2 Notebook(s)
+