DeepRLCourse · AmirMohammadFakhimi · Feb 28, 2025 · Feb 28, 2025 · Mar 1, 2025 · Mar 1, 2025
diff --git a/docs/course_notes/intro-to-rl.md b/docs/course_notes/intro-to-rl.md
@@ -204,7 +204,7 @@ Two fundamental problems in sequential decision making:
 
 2. **Planning**:
     - A model of the environment is **known**
-    - The agent performs computations with its model (w**ithout any external
+    - The agent performs computations with its model (**without any external
     interaction**)
     - The agent **improves** its policy, a.k.a. deliberation, reasoning, introspection, pondering, thought, search 
 
@@ -289,18 +289,19 @@ In the provided Gridworld example, the agent starts from the yellow square and h
    Fig4. Grid World Example </center>
 
 The agent's choice depends on:
-- The **discount factor ($\gamma$)**, which determines whether it prioritizes short-term or long-term rewards.
-- The **noise level**, which introduces randomness into actions.
+   - The **discount factor ($\gamma$)**, which determines whether it prioritizes short-term or long-term rewards.
+   - The **noise level**, which introduces randomness into actions.
 
 Depending on the values of $\gamma$ and noise, the agent's behavior varies:
+
 1. **$\gamma$ = 0.1, noise = 0.5:**  
-   - The agent **prefers the close exit (+1) but takes the risk of stepping into the cliff (-10).**  
+   - The agent **prefers the close exit (+1) but doesn't take the risk of stepping into the cliff (-10).**  
 2. **$\gamma$ = 0.99, noise = 0:**  
-   - The agent **prefers the distant exit (+10) while avoiding the cliff (-10).**  
+   - The agent **prefers the distant exit (+10) and takes the risk of the cliff (-10).**  
 3. **$\gamma$ = 0.99, noise = 0.5:**  
-   - The agent **still prefers the distant exit (+10), but due to noise, it risks the cliff (-10).**  
+   - The agent **still prefers the distant exit (+10), but due to noise, it doesn't risk the cliff (-10).**  
 4. **$\gamma$ = 0.1, noise = 0:**  
-   - The agent **chooses the close exit (+1) while avoiding the cliff.**  
+   - The agent **chooses the close exit (+1) and takes the risk of the cliff.**  
 
 ### Stochastic Policy  
 
@@ -445,4 +446,4 @@ Consider the Grid World example where the agent navigates to a goal while avoidi
         [:fontawesome-brands-linkedin-in:](https://www.linkedin.com/in/masoud-tahmasbi-fard/){:target="_blank"}
         </p>
     </span>
-</div>
+</div>
diff --git a/docs/course_notes/value-based.md b/docs/course_notes/value-based.md
@@ -44,6 +44,15 @@ Where:
 
 This equation allows for the iterative computation of state values in a model-based setting.
 
+#### Bellman Optimality Equation for $V^*(s)$:
+The **Bellman Optimality Equation** for $V^*(s)$ expresses the optimal state value function. It is given by:
+
+$$
+V^*(s) = \max_a \mathbb{E} \left[ R_{t+1} + \gamma V^\*(S_{t+1}) \mid s_t = s, a_t = a \right]
+$$
+
+This shows that the optimal value at each state is the immediate reward plus the discounted maximum expected value from the next state, where the next action is chosen optimally.
+
 ---
 
 ### 1.2. Action Value Function $Q(s, a)$
@@ -80,7 +89,7 @@ Where:
 The **Bellman Optimality Equation** for $Q^*(s, a)$ expresses the optimal action value function. It is given by:
 
 $$
-Q^*(s, a) = \mathbb{E} \left[ R_{t+1} + \gamma \max_{a'} Q^*(s_{t+1}, a') \mid s_t = s, a_t = a \right]
+Q^*(s, a) = \mathbb{E} \left[ R_{t+1} + \gamma \max_{a'} Q^\*(s_{t+1}, a') \mid s_t = s, a_t = a \right]
 $$
 
 This shows that the optimal action value at each state-action pair is the immediate reward plus the discounted maximum expected value from the next state, where the next action is chosen optimally.
@@ -274,7 +283,7 @@ $$
 \hat{I}_N = \frac{1}{N} \sum_{i=1}^{N} f(x_i),
 $$
 
-where $ x_i $ are **independent** samples drawn from $ p(x) $. The **Law of Large Numbers (LLN)** ensures that as $N \to \infty$:
+where $x_i$ are **independent** samples drawn from $p(x)$. The **Law of Large Numbers (LLN)** ensures that as $N \to \infty$:
 
 $$
 \hat{I}_N \to I.
@@ -639,4 +648,4 @@ The choice of method depends on the environment, the availability of a model, an
         [:fontawesome-brands-linkedin-in:](https://www.linkedin.com/in/ghazal-hosseini-mighan-8b911823a){:target="_blank"}
         </p>
     </span>
-</div> -->
+</div> -->