diff --git a/docs/course_notes/intro-to-rl.md b/docs/course_notes/intro-to-rl.md index 48de3094..3b4836cd 100644 --- a/docs/course_notes/intro-to-rl.md +++ b/docs/course_notes/intro-to-rl.md @@ -204,7 +204,7 @@ Two fundamental problems in sequential decision making: 2. **Planning**: - A model of the environment is **known** - - The agent performs computations with its model (w**ithout any external + - The agent performs computations with its model (**without any external interaction**) - The agent **improves** its policy, a.k.a. deliberation, reasoning, introspection, pondering, thought, search @@ -289,18 +289,19 @@ In the provided Gridworld example, the agent starts from the yellow square and h Fig4. Grid World Example The agent's choice depends on: -- The **discount factor ($\gamma$)**, which determines whether it prioritizes short-term or long-term rewards. -- The **noise level**, which introduces randomness into actions. + - The **discount factor ($\gamma$)**, which determines whether it prioritizes short-term or long-term rewards. + - The **noise level**, which introduces randomness into actions. Depending on the values of $\gamma$ and noise, the agent's behavior varies: + 1. **$\gamma$ = 0.1, noise = 0.5:** - - The agent **prefers the close exit (+1) but takes the risk of stepping into the cliff (-10).** + - The agent **prefers the close exit (+1) but doesn't take the risk of stepping into the cliff (-10).** 2. **$\gamma$ = 0.99, noise = 0:** - - The agent **prefers the distant exit (+10) while avoiding the cliff (-10).** + - The agent **prefers the distant exit (+10) and takes the risk of the cliff (-10).** 3. **$\gamma$ = 0.99, noise = 0.5:** - - The agent **still prefers the distant exit (+10), but due to noise, it risks the cliff (-10).** + - The agent **still prefers the distant exit (+10), but due to noise, it doesn't risk the cliff (-10).** 4. **$\gamma$ = 0.1, noise = 0:** - - The agent **chooses the close exit (+1) while avoiding the cliff.** + - The agent **chooses the close exit (+1) and takes the risk of the cliff.** ### Stochastic Policy @@ -445,4 +446,4 @@ Consider the Grid World example where the agent navigates to a goal while avoidi [:fontawesome-brands-linkedin-in:](https://www.linkedin.com/in/masoud-tahmasbi-fard/){:target="_blank"}

- \ No newline at end of file + diff --git a/docs/course_notes/value-based.md b/docs/course_notes/value-based.md index 3c593d6b..c6a4c26f 100644 --- a/docs/course_notes/value-based.md +++ b/docs/course_notes/value-based.md @@ -44,6 +44,15 @@ Where: This equation allows for the iterative computation of state values in a model-based setting. +#### Bellman Optimality Equation for $V^*(s)$: +The **Bellman Optimality Equation** for $V^*(s)$ expresses the optimal state value function. It is given by: + +$$ +V^*(s) = \max_a \mathbb{E} \left[ R_{t+1} + \gamma V^\*(S_{t+1}) \mid s_t = s, a_t = a \right] +$$ + +This shows that the optimal value at each state is the immediate reward plus the discounted maximum expected value from the next state, where the next action is chosen optimally. + --- ### 1.2. Action Value Function $Q(s, a)$ @@ -80,7 +89,7 @@ Where: The **Bellman Optimality Equation** for $Q^*(s, a)$ expresses the optimal action value function. It is given by: $$ -Q^*(s, a) = \mathbb{E} \left[ R_{t+1} + \gamma \max_{a'} Q^*(s_{t+1}, a') \mid s_t = s, a_t = a \right] +Q^*(s, a) = \mathbb{E} \left[ R_{t+1} + \gamma \max_{a'} Q^\*(s_{t+1}, a') \mid s_t = s, a_t = a \right] $$ This shows that the optimal action value at each state-action pair is the immediate reward plus the discounted maximum expected value from the next state, where the next action is chosen optimally. @@ -274,7 +283,7 @@ $$ \hat{I}_N = \frac{1}{N} \sum_{i=1}^{N} f(x_i), $$ -where $ x_i $ are **independent** samples drawn from $ p(x) $. The **Law of Large Numbers (LLN)** ensures that as $N \to \infty$: +where $x_i$ are **independent** samples drawn from $p(x)$. The **Law of Large Numbers (LLN)** ensures that as $N \to \infty$: $$ \hat{I}_N \to I. @@ -639,4 +648,4 @@ The choice of method depends on the environment, the availability of a model, an [:fontawesome-brands-linkedin-in:](https://www.linkedin.com/in/ghazal-hosseini-mighan-8b911823a){:target="_blank"}

- --> \ No newline at end of file + -->