Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 9 additions & 8 deletions docs/course_notes/intro-to-rl.md
Original file line number Diff line number Diff line change
Expand Up @@ -204,7 +204,7 @@ Two fundamental problems in sequential decision making:

2. **Planning**:
- A model of the environment is **known**
- The agent performs computations with its model (w**ithout any external
- The agent performs computations with its model (**without any external
interaction**)
- The agent **improves** its policy, a.k.a. deliberation, reasoning, introspection, pondering, thought, search

Expand Down Expand Up @@ -289,18 +289,19 @@ In the provided Gridworld example, the agent starts from the yellow square and h
Fig4. Grid World Example </center>

The agent's choice depends on:
- The **discount factor ($\gamma$)**, which determines whether it prioritizes short-term or long-term rewards.
- The **noise level**, which introduces randomness into actions.
- The **discount factor ($\gamma$)**, which determines whether it prioritizes short-term or long-term rewards.
- The **noise level**, which introduces randomness into actions.

Depending on the values of $\gamma$ and noise, the agent's behavior varies:

1. **$\gamma$ = 0.1, noise = 0.5:**
- The agent **prefers the close exit (+1) but takes the risk of stepping into the cliff (-10).**
- The agent **prefers the close exit (+1) but doesn't take the risk of stepping into the cliff (-10).**
2. **$\gamma$ = 0.99, noise = 0:**
- The agent **prefers the distant exit (+10) while avoiding the cliff (-10).**
- The agent **prefers the distant exit (+10) and takes the risk of the cliff (-10).**
3. **$\gamma$ = 0.99, noise = 0.5:**
- The agent **still prefers the distant exit (+10), but due to noise, it risks the cliff (-10).**
- The agent **still prefers the distant exit (+10), but due to noise, it doesn't risk the cliff (-10).**
4. **$\gamma$ = 0.1, noise = 0:**
- The agent **chooses the close exit (+1) while avoiding the cliff.**
- The agent **chooses the close exit (+1) and takes the risk of the cliff.**

### Stochastic Policy

Expand Down Expand Up @@ -445,4 +446,4 @@ Consider the Grid World example where the agent navigates to a goal while avoidi
[:fontawesome-brands-linkedin-in:](https://www.linkedin.com/in/masoud-tahmasbi-fard/){:target="_blank"}
</p>
</span>
</div>
</div>
15 changes: 12 additions & 3 deletions docs/course_notes/value-based.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,15 @@ Where:

This equation allows for the iterative computation of state values in a model-based setting.

#### Bellman Optimality Equation for $V^*(s)$:
The **Bellman Optimality Equation** for $V^*(s)$ expresses the optimal state value function. It is given by:

$$
V^*(s) = \max_a \mathbb{E} \left[ R_{t+1} + \gamma V^\*(S_{t+1}) \mid s_t = s, a_t = a \right]
$$

This shows that the optimal value at each state is the immediate reward plus the discounted maximum expected value from the next state, where the next action is chosen optimally.

---

### 1.2. Action Value Function $Q(s, a)$
Expand Down Expand Up @@ -80,7 +89,7 @@ Where:
The **Bellman Optimality Equation** for $Q^*(s, a)$ expresses the optimal action value function. It is given by:

$$
Q^*(s, a) = \mathbb{E} \left[ R_{t+1} + \gamma \max_{a'} Q^*(s_{t+1}, a') \mid s_t = s, a_t = a \right]
Q^*(s, a) = \mathbb{E} \left[ R_{t+1} + \gamma \max_{a'} Q^\*(s_{t+1}, a') \mid s_t = s, a_t = a \right]
$$

This shows that the optimal action value at each state-action pair is the immediate reward plus the discounted maximum expected value from the next state, where the next action is chosen optimally.
Expand Down Expand Up @@ -274,7 +283,7 @@ $$
\hat{I}_N = \frac{1}{N} \sum_{i=1}^{N} f(x_i),
$$

where $ x_i $ are **independent** samples drawn from $ p(x) $. The **Law of Large Numbers (LLN)** ensures that as $N \to \infty$:
where $x_i$ are **independent** samples drawn from $p(x)$. The **Law of Large Numbers (LLN)** ensures that as $N \to \infty$:

$$
\hat{I}_N \to I.
Expand Down Expand Up @@ -639,4 +648,4 @@ The choice of method depends on the environment, the availability of a model, an
[:fontawesome-brands-linkedin-in:](https://www.linkedin.com/in/ghazal-hosseini-mighan-8b911823a){:target="_blank"}
</p>
</span>
</div> -->
</div> -->