From eb7e253f162494fef32e5f12da04b93c7a3033ee Mon Sep 17 00:00:00 2001
From: Sayeh Jarollahi <62176180+sayehjarollahi@users.noreply.github.com>
Date: Fri, 7 Jan 2022 23:42:26 +0330
Subject: [PATCH 01/13] initialize the project
---
notebooks/21_reinforcement_learning/index.md | 518 ++++++++++++++++++
.../21_reinforcement_learning/metadata.yml | 29 +
.../21_reinforcement_learning/resources/1.png | Bin 0 -> 10274 bytes
.../21_reinforcement_learning/resources/2.png | Bin 0 -> 12119 bytes
.../21_reinforcement_learning/resources/3.png | Bin 0 -> 12179 bytes
.../21_reinforcement_learning/resources/4.png | Bin 0 -> 12520 bytes
.../resources/agent.jpg | Bin 0 -> 28238 bytes
.../resources/cheetah.mp4 | Bin 0 -> 5784208 bytes
.../resources/modelbased.png | Bin 0 -> 12206 bytes
.../resources/one.PNG | Bin 0 -> 36511 bytes
.../resources/online.jpg | Bin 0 -> 668361 bytes
.../resources/two.PNG | Bin 0 -> 42989 bytes
12 files changed, 547 insertions(+)
create mode 100644 notebooks/21_reinforcement_learning/index.md
create mode 100644 notebooks/21_reinforcement_learning/metadata.yml
create mode 100644 notebooks/21_reinforcement_learning/resources/1.png
create mode 100644 notebooks/21_reinforcement_learning/resources/2.png
create mode 100644 notebooks/21_reinforcement_learning/resources/3.png
create mode 100644 notebooks/21_reinforcement_learning/resources/4.png
create mode 100644 notebooks/21_reinforcement_learning/resources/agent.jpg
create mode 100644 notebooks/21_reinforcement_learning/resources/cheetah.mp4
create mode 100644 notebooks/21_reinforcement_learning/resources/modelbased.png
create mode 100644 notebooks/21_reinforcement_learning/resources/one.PNG
create mode 100644 notebooks/21_reinforcement_learning/resources/online.jpg
create mode 100644 notebooks/21_reinforcement_learning/resources/two.PNG
diff --git a/notebooks/21_reinforcement_learning/index.md b/notebooks/21_reinforcement_learning/index.md
new file mode 100644
index 00000000..129becef
--- /dev/null
+++ b/notebooks/21_reinforcement_learning/index.md
@@ -0,0 +1,518 @@
+# Table of content
+- [Introduction](#introduction)
+- [Online vs Offline Learning](#OnlinevsOfflineLearning)
+- [Main Elements of RL](#MainElementsofRL)
+- [Different Methods of Learning According to Model](#DifferentMethodsofLearningAccordingtoModel)
+ - [Model-based RL](#Model-basedRL)
+ - [Model-free RL](#Model-freeRL)
+ - [an example](#anexample1)
+- [Types of Reinforcement Learning According to Learning Policy](#TypesofReinforcementLearningAccordingtoLearningPolicy)
+ - [Passive Reinforcement Learning](#PassiveReinforcementLearning)
+ - [Active Reinforcement Learning](#ActiveReinforcementLearning)
+- [Policy Evaluation](#PolicyEvaluation)
+- [Direct Utility Estimation](#DirectUtilityEstimation)
+ - [Definition](#Definition)
+ - [an example](#anexample2)
+- [Advantages of Direct Utility Estimation](#AdvantagesofDirectUtilityEstimation)
+- [Disadvantages of Direct Utility Estimation](#DisadvantagesofDirectUtilityEstimation)
+- [Temporal Difference (TD) Learning](#TemporalDifference(TD)Learning)
+ - [Definition](#DefinitionTD)
+ - [pseudocode](#pseudocode)
+ - [Some aspects of TD](#SomeaspectsofTD)
+ - [an example](#anexample3)
+- [Problem with TD](#ProblemwithTD)
+- [Conclusion](#Conclusion)
+- [References](#resources)
+
+
+
+
+# Inroduction
+
+_**"... What we want is a machine that can learn from experience." -Alan Turing, 1947**_
+
+
+Reinforcement learning is a subcategory of Machine Learning which solves problems that involve learning what to do—how to
+map situations to actions— to maximize a numerical reward signal. In this learning system, the system's actions influence its later inputs. Moreover, the learner is not told which actions to take but instead must discover
+which actions yield the most reward by trying them out.
+
+
+

+
+
relation of agent and environment
+
+
+
+
+
+
+## Online vs Offline Learning
+Overall, there are two learning methods—online and offline. In the online approach, the agent has no information about the environment. It gathers information by rewards and decides which states are good and which are bad and should be avoided. However, in the offline method, the agent has prior information, helping it decide on its actions based on the information and receive a less negative reward.
+Reinforcement learning (RL) is an MDP without $T$ and $R$, meaning it has no information on the environment. Thus, the learning method in RL is online.
+
+

+
+
+
+
+
+
+## Main Elements of RL
+1. Policy $\pi : S \rightarrow A$
+ * It is a map from state space to action space.
+ * It may be stochastic.
+2. Reward Function R(s)
+ * It maps each state to a real number, called **reward**.
+ * Agent's utility is defined by the reward function. Agent must learn to maximize expected rewards.
+3. Value Function V(s)
+ * The value of a state is the total expected reward starting from that state.
+
+
+
+
+
+
+## Different Methods of Learning According to Model
+RL systems can be divided into two categories: **model-based** and **model-free**
+
+
+* **Model-based RL :** This method tries to overcome the issue of a lack of prior knowledge to construct a functional representation of its environment. It solves the problem in two steps:
+
+ 1. **Learn imprical MDP model :** In this model, the agent explores the environment and performs some actions to estimate the parameters of the model. These parameters are $\hat{T}(s, a, s')$ and $\hat{R}(s, a, s')$. First, it counts outcomes $s'$ for each $s,a$. Second, it normalizes the values to give an estimate of $\hat{T}(s,a,s')$. At last, it discovers each $\hat{R}(s,a,s')$ when it experiences $(s, a, s')$.
+
+ 2. **Solve the learned MDP:** After estimating the missing parameters of the MDP model, the agent can solve the MDP by using various algorithms (e.g. _Value Iteration_) that are mentioned before.
+
+* **Model-free RL:** Unlike the previous category, in this type of Reinforcement Learning, the parameters of the traditional MDP are not estimated. Consequently, the agent directly learns the policy and the evaluation function. This algorithm can be thought of as a _trial and error_ algorithm.
+
+
+
+
+
+**Example:** Assume that an agent wants to compute the expected age of students in one class. ($A$ is a random variable that represents the distribution of age) If the agent has the distribution of $A$, it can easily calculate the expectation according to $E[A] = \sum_{a} P(a).a$ . On the other hand, if the agent doesn't have the distribution over $A$, the problem must use Reinforcement Learning that can be solved with model-based or model-free RL.
+
+* **Model-based approach**: In this algorithm, first we have to estimate $\hat{P}(a)$. The estimation is simple in this special example, but it can be complicated in real-world problems. We start exploring and getting samples and then We set $\hat{P}(a) = \frac{num(a)}{N}$. $num(a)$ is the number of occurrences of the age $a$. Finally, after finding the probability function, we can find the expectation with $E[A] \approx \sum_{a} \hat{P}(a).a$.
+
+* **Model-free approach**: In this algorithm, there is no need to find $\hat{P}(a)$. After exploring the envirnoment and gathering samples, we can directly calculate the value of $E[A] = \frac{1}{N}\sum_{i} a_i$. Samples appear with accurate frequencies. As a result, we can be sure that this algorithm works efficiently.
+
+**Example:** In this example, we are going to estimate $\hat{T}(s,a,s')$ and $\hat{R}(s,a,s')$ using model-based RL. In these kinds of questions, we need to fix the input policy $(\pi)$ and observe episodes to train the model and estimate the parameters. Each episode contains a series of movements with the same length (in this example the length of each episode is 3). After observing, we can estimate the possibility of each transition and the reward function for each movement.
+
+
+
+| | Episode1 Episode2 |
+|-| ----|
+|| | B | east | C | -1 | B | east | C | -1 | | C | east | D | -1 | C | east | D | -1 |
| D | exit | X | +10 | D | exit | X | +10 |
+
+| | Episode3 Episode4 |
+|-| ----|
+|| | E | north | C | -1 | E | north | C | -1 | | C | east | D | -1 | C | east | A | -1 |
| D | exit | X | +10 | A | exit | X | +10 |
+
+
+
+
+After sampling, in order to estimate $\hat{T}$ for each $(s,a,s')$, we have to count the number of all movements that start from $s$ and have the action $a$ and go to state $s'$. After that we have to divise this count to the number of movements that start from $s$ and have the action $a$. For example:
+* $T(B, east, C) = \frac{2}{2} = 1$
+* $T(C, east, D) = \frac{3}{4} = 0.75$
+
+In addition, we can estimate $\hat{R}(s,a,s')$ for each movement in all episodes. We have to get the average of the reward for all tuples of $(s,a,s')$. For example:
+* $R(B, east, C) = \frac{-1-1}{2} = -1$
+* $R(D, exit, x) = \frac{10+10+10}{3} = 10$
+
+**Question:** What is the problem of above model-based algorithm?
If the number of states are large, the number of required samples to train the model will be $O((|S|)^2.|A|)$. In other words, we need many episodes to be able to train the model properly.
+
+
+
+## Types of Reinforcement Learning According to Learning Policy
+Ther are two types of RL according to Learning policy: **Passive Reinforcement Learning**, **Active Reinforcement Learning**. In below, we explain both methods.
+
+
+
+## Passive Reinforcement Learning
+Passive reinforcement learning is when we want an agent to learn about the utilities of various states under a fixed policy. The learned policy will be different from the original fixed policy. _**Direct Utility Estimation**_ and _**Temporal Difference Learning**_ are two passive RL algorithms.
+
+
+
+## Active Reinforcement Learning
+Active reinforcement learning is when the policy of the agent is not fixed and can change during the time of training. In this method exploration and exploition is done. Exploration refers to trying new actions that are rarely done and checking if they have a bigger reward. Exploitation refers to keep doing the optimal action at each state. _Q-learning_ is one of the active RL algorithms.
+
+
+
+## Policy Evaluation
+Policy evaluation computes the value functions for a policy $\pi$ using the Bellman equations.
+$$V^{\pi}_0(s) = 0$$
+$$V^{\pi}_{k+1} \leftarrow \sum_{s'}T(s, \pi(s), s')[R(s, \pi, s') + \gamma V^{\pi}_k(s')$$
+In fact, simplified Bellman updates calculate V for a fixed policy:
+* Each round, replace V with a one-step-look-ahead layer over V
+* This approach fully exploited the connections between the states
+
+
+
+# Direct Utility Estimation
+
+
+
+**Definition:** In this method, the agent executes a sequence of trials or runs (sequences of state-action transitions that continue until the agent reaches the terminal state). Each trial gives a sample value and the agent estimates the utility based on the samples values. This can be calculated as running averages of sample values.
+
+
+* **Direct utility estimation (model-free)**
+ * Estimate **$V^{\pi}(s)$** as the average total reward of epochs containing $s$ (calculating from $s$ to end of epoch)
+
+* **Reward-to-go of a state** **$s$**
+ * The sum of the (discounted)rewards from that state until a terminal state is reached
+
+* **Key**: use observed reward-to-go of the state as the direct evidence of the actual expected utility of that state
+
+Suppose we have a 4x3 grid as the environment in which the agent can move either Left, Right, Up or Down(set of available actions). An example of a run:
+
+$(1,1) _{-0.04} \rightarrow (1,2) _{-0.04} \rightarrow (1,3) _{-0.04} \rightarrow (1,2) _{-0.04} \rightarrow (1,3) _{-0.04} \rightarrow (2,3) _{-0.04} \rightarrow (3,3) _{-0.04} \rightarrow (4,3) _{+1}$
+
+The total reward starting at (1,1) is 0.72. We call this a sample of the observed-reward-to-go for (1,1).
+
+For (1,2) there are two samples for the observed-reward-to-go (assuming $\gamma$=1):
+
+1. $(1,2) _{-0.04} \rightarrow (1,3) _{-0.04} \rightarrow (1,2) _{-0.04} \rightarrow (1,3) _{-0.04} \rightarrow (2,3) _{-0.04} \rightarrow (3,3) _{-0.04} \rightarrow (4,3) _{+1} [Total:0.76]$
+
+2. $(1,2) _{-0.04} \rightarrow (1,3) _{-0.04} \rightarrow (2,3) _{-0.04} \rightarrow (3,3) _{-0.04} \rightarrow (4,3) _{+1} [Total:0.84]$
+
+* Direct Utility Estimation keeps a running average of this observed reward-to-go for each state
+
+* Eg. for state (1,2), it stores $\frac{(0.76+0.84)}{2}=0.8$
+
+Thus, at the end of each sequence, the algorithm calculates the observed reward-to-go for each state and updates the estimated utility for that state accordingly, just by keeping a running average for each state in a table. In the limit of infinitely many trials, the sample average will converge to the true expectation in the following Equation.
+
+ **$V^\pi (s) = E [\sum_{t = 0}^{\infty } \gamma^t R (S_{t})]$**
+
+Direct utility estimation succeeds in reducing the reinforcement learning problem to an inductive learning problem, about which much is known. Unfortunately, it misses a very important source of information, namely, the fact that the utilities of states are not independent! The utility of each state equals its own reward plus the expected utility of its successor states. That is, the utility values obey the Bellman equations for a fixed policy (the following Equation)
+
+ **$V^\pi (s) =R(S)+\gamma \sum_{\acute{s}} P(\acute{s}|s,\pi(s)) U^\pi(\acute{s})$**
+
+
+
+
+And now it is good to see an example. The purpose of this question is to estimate the returns.
+
+
+
+**Example:** Calculate the return of each state according to the figure below and the given episodes.($\gamma =1$)
+
+
+
+
+
+| | Episode1 Episode2 |
+|-| ----|
+|| | B | east | C | -1 | B | east | C | -1 | | C | east | D | -1 | C | east | D | -1 |
| D | exit | X | +10 | D | exit | X | +10 |
+
+| | Episode3 Episode4 |
+|-| ----|
+|| | E | north | C | -1 | E | north | C | -1 | | C | east | D | -1 | C | east | A | -1 |
| D | exit | X | +10 | A | exit | X | +10 |
+
+
+
+
+
+
+ Now we solve this example:
+
+
+
+
+
+
+
According to the question, you have to estimate the return seen in each state.
+
+According to the given episodes and $\gamma = 1$, you should look for your state in the first column of each episode, then average the numbers(s) in the last column, from that point until the end.
+
+Pay attention to the solution:
+
+**state A:** You can see A just in the fourth episode $\rightarrow$ $\frac{(+10)}{1}=+10$
+
+**state B:** You can see B in the first and second episodes $\rightarrow$
+$\frac{( -1 -1 + 10) + ( -1 -1 + 10 ) }{ 2} = +8$
+
+
+**state C:** You can see C in all episodes $\rightarrow$
+$\frac{( -1 + 10 )+( -1+10 ) + ( -1 + 10 )+( -1 -10 )} {4} = +4$
+
+**state D:** You can see D in The first three states $\rightarrow$
+$\frac{(+10+10+10)}{3} =+10$
+
+If you pay more attention, you will notice that the problem with this method is, we can move to C both from E and B, but we find very different values for them. The reason for this difference is the lack of sample. we also express the values of each state independent from other states.
+
+
+
+
+## Advantages of Direct Utility Estimation
+
+* Easy to understand
+
+* Does not require any knowledge of T, R
+
+* Computes the correct average values, using just sample transitions
+
+
+
+
+
+
+## Disadvantages of Direct Utility Estimation
+* Wastes information about state connections
+
+* Each state must be learned separately
+
+* Takes a long time to learn
+
+* Converges very slowly
+
+ Why?
+
+ **-** Does not exploit the fact that utilities of states are not independent
+
+ **-** Utilities follow the Bellman equation
+
+ $V_{\pi} (s) =R(S)+\gamma \sum_{\acute{s}} T(s,\pi(s),{\acute{s}})V_{\pi}({\acute{s}})$
+
+
+
+
+
+**Question**: How to approximate E[X] (The right side of the equation below) based on model-free learning?
+
+$V^\pi_{k+1}(s) \leftarrow \sum_{{\acute{s}}} T(s,\pi(s),\acute{s} )[R(s,\pi(s),\acute{s} )+\gamma V^\pi_{k} ]$
+
+We cannot use this equation directly since we do not have T and R, but we can use the average of the returned value ($[R(s,\pi(s),\acute{s} )+\gamma V^\pi_{k} ]$).
+So we can change our formula to the following form.
+
+$V^\pi (s) \approx \frac{1}{N} \sum [R(s,\pi(s),\acute{s})+\gamma V^\pi(\acute{s})]$
+
+If you pay a little attention, you will notice that there is a problem, so what is the problem with this approach? In one state, an action cannot be repeated N times, because you move on to another state as soon as an action is taken and it will be physically impossible be able to perform another action in the same state. Now, what is the solution?
+
+You can take the average in more than one shot, for example, take different actions in multiple episodes and update the average as you arrive to the desired state.
+For example, if the average is $\varphi$ up to this point, the updated average would be:
+
+$\varphi_{new} = \varphi n + \frac {x} {n+1}$
+
+In which $n$ is the number of data you have collected.
+
+
+
+
+
+
+
+
+# Temporal Difference (TD) Learning
+
+
+
+**Definition:** TD is an agent learning from an environment through episodes with no prior knowledge of the environment.
+
+* It learns from trial and error.
+
+* TD takes a model-free or unsupervised learning approach.
+
+* TD tries to predict the immediate reward and the reward at the next moment instead of calculating the total future reward.
+
+The key concept of the a TD agent: ($R$ is the reward function)
+
+$$R_{t} = r_{t+1} + \gamma r_{t+2} + \gamma ^{2} r_{t+3} + ... = \sum_{k = 0}^{\infty }\gamma ^{k} r_{t + k + 1}$$
+
+In fact, TD learning tries to answer the question of how to compute this weighted average without the weights, cleverly doing so with an exponential
+moving average. In this equation reward at time **t** is a combination of discounted rewards in the future which shows that future rewards are valued less.
+
+TD agent error:
+
+$$ E_{t} = V^{\pi ^ *}_{t} - V^\pi_{t} $$\
+$$= r_{t+1} + \sum_{k = 1}^{\infty }\gamma ^{k} r_{t + k + 1} - V^\pi_{t} $$\
+$$= r_{t+1} + \gamma \times \sum_{k = 1}^{\infty }\gamma ^{k-1} r_{t + k + 1} - V^\pi_{t} $$\
+$$= r_{t+1} + \gamma \times \sum_{k = 0}^{\infty }\gamma ^{k} r_{(t + 1) + (k + 1)} - V^\pi_{t}$$\
+$$= r_{t+1} + \gamma \times \sum_{k = 0}^{\infty }\gamma ^{k} r_{(t + 1) + (k + 1)} - V^\pi_{t}$$\
+$$= r_{t+1} + \gamma \times V^\pi_{t+1} - V^\pi_{t}$$
+
+Updating the value: ($\alpha$ is learning rate)
+
+$$V^\pi_{t} \leftarrow V^\pi_{t} + \alpha \times E_{t} = V^\pi_{t} + \alpha \times( r_{t+1} + \gamma \times V^\pi_{t+1} - V^\pi_{t})$$
+
+
+
+## Pseudocode
+The pseudocode of the process is shown below.
+```
+function PASSIVE-TD-AGENT(percept) returns an action
+ inputs: percept, a percept indicating the current state s' and rewarl signal r'
+ persistent: pi, a fixed policy
+ V, a table of values(utilities), initially empty
+ N_{s}, a of frequencies for states, initially zero
+ s, a, r, the previous state, action, and reward, initially null
+ if s' is new then V[s'] = r'
+ if s is not null then :
+ increment N_{s}[s]
+ V[s] = V[s] + alpha * (N_{s}[s])(r + gamma * V[s'] - V[s])
+ if s'.TERMINAL? then s, a, r = null
+ else :
+ s = s'
+ a = pi[s']
+ r = r'
+ return a
+```
+It is good to mention that because rare transitions occur only rarely, the average of **value function** will converge to the correct value.
+
+
+
+## Some aspects of TD
+1. TD shows almost high variability.
+2. TD is a quite simple method in RL.
+3. TD does quite little computation per observation.
+4. TD adjusts a state to be consistent with its **observed** successor states.
+5. TD makes a single adjustment per observation.
+6. TD is a crude but efficient approximation to other methods in RL.
+7. TD does not need a transition model to do the update.
+8. TD uses **exponential moving average** which runs an interpolation and makes recent samples more important in the process. The method forgets about the past. The recursive formula is written below.
+$$x_{n} = (1-\alpha) . x_{n-1} + \alpha . x_{n}$$
+ * In this formula:
+ a decreases → converging averages
+
+
+
+**Example**
+
+Consider the following grid with $\gamma = 1$ and $\alpha = 1/2$. We have the initial $V^{\pi}(x_i)$ in State 1. We want to calculate $V^{\pi}(B)$ and $V^{\pi}(C)$ after sampling and two moves.
+
+
+State 1 |Move 1(B, east, C, -2)|State 2|Move 2 (C, east, D, -2)|State 3 |
+:------------------:|:----------:|:------------------:|:----------:|:------------------:|
+
||
||
|
+
+The formula is:
+$$V^\pi_{t} \leftarrow V^\pi_{t} + \alpha \times E_{t} = V^\pi_{t} + \alpha \times( r_{t+1} + \gamma \times V^\pi_{t+1} - V^\pi_{t})$$
+Calculations:
+
+$V^{\pi}(B) = ?$
+
+$V^{\pi}(B) \leftarrow V^{\pi}(B) + \alpha \times (-2 + \gamma \times V^{\pi}(C))$
+
+$\rightarrow V^{\pi}(B) = 0 + 1/2 \times (-2 + 1 \times 0) = -1$
+
+$V^{\pi}(C) = ?$
+
+$V^{\pi}(C) \leftarrow V^{\pi}(C) + \alpha \times (-2 + \gamma \times V^{\pi}(D))$
+
+$\rightarrow V^{\pi}(C) = 0 + 1/2 \times (-2 + 1 \times 8) = 3$
+
+For better understanding and seeing how a TD agent acts in the environment, see the code below, which belongs to "HalfCheetah-v2", a TD agent.
+```
+import numpy as np
+from gym import utils
+from gym.envs.mujoco import mujoco_env
+
+
+class HalfCheetahEnv(mujoco_env.MujocoEnv, utils.EzPickle):
+ def __init__(self):
+ mujoco_env.MujocoEnv.__init__(self, "half_cheetah.xml", 5)
+ utils.EzPickle.__init__(self)
+
+ def step(self, action):
+ xposbefore = self.sim.data.qpos[0]
+ self.do_simulation(action, self.frame_skip)
+ xposafter = self.sim.data.qpos[0]
+ ob = self._get_obs()
+ reward_ctrl = -0.1 * np.square(action).sum()
+ reward_run = (xposafter - xposbefore) / self.dt
+ reward = reward_ctrl + reward_run
+ done = False
+ return ob, reward, done, dict(reward_run=reward_run, reward_ctrl=reward_ctrl)
+
+ def _get_obs(self):
+ return np.concatenate(
+ [
+ self.sim.data.qpos.flat[1:],
+ self.sim.data.qvel.flat,
+ ]
+ )
+
+ def reset_model(self):
+ qpos = self.init_qpos + self.np_random.uniform(
+ low=-0.1, high=0.1, size=self.model.nq
+ )
+ qvel = self.init_qvel + self.np_random.standard_normal(self.model.nv) * 0.1
+ self.set_state(qpos, qvel)
+ return self._get_obs()
+
+ def viewer_setup(self):
+ self.viewer.cam.distance = self.model.stat.extent * 0.5
+```
+The video of this agent in the environment is shown below.
+
+https://user-images.githubusercontent.com/62206142/148566163-089a2a3c-4361-4997-b956-d4e790cf1cbd.mp4
+
+
+
+## Problem with TD
+All we want is to find the best policy that suits us. Although TD agent finds the value of each state (A value that converges to the real value during the time), it cannot find the best policy because for finding that and doing one-step expectimax, $T$ and $R$ functions are needed. However, in RL, they are not available. Therefore, a new method is needed which is called **Q-Learning**.
+
+
+
+# Conclusion
+
+
+
+### Summary of Discussed RL Methods
+#### Direct Utility Estimation
+* Simple to implement
+* Each update is fast
+* Does not exploit Bellman constraints and converges slowly
+
+#### Temporal Difference Learning
+* Update speed and implementation similar to direct estimation
+* Partially exploits Bellman constraints-adjusts state to 'agree' with the observed successor (Not **all** possible successors)
+
+### Why to use RL
+Here are prime reasons for using Reinforcement Learning:
+
+* It helps you to find which situation needs an action
+* It Helps you to discover which action yields the highest reward over the longer period.
+* It provides the learning agent with a reward function.
+* It allows it to figure out the best method for obtaining large rewards.
+
+### Usage of RL
+Here are applications of Reinforcement Learning:
+
+* Robotics for industrial automation.
+* Business strategy planning
+* Machine learning and data processing
+* It helps you to create training systems that provide custom instruction and materials according to the requirement of students.
+* Aircraft control and robot motion control
+
+
+
+# References
+[1] Passive Reinforcement Learning. https://kcir.pwr.edu.pl/~witold/ai/mle_rl_h.pdf (visited: 2/1/2022)
+
+[2] Active vs Passive Learning. Shweta Bhatt. https://towardsdatascience.com/explaining-reinforcement-learning-active-vs-passive-a389f41e7195 (visited: 2/1/2022)
+
+[3] Active and Passive Learning. Philipp Koehn. https://www.cs.jhu.edu/~phi/ai/slides/lecture-reinforcement-learning.pdf (visited: 25/12/2021)
+
+[4] Model-Based vs. Model-Free RL. Russell and Norvig. https://web.engr.oregonstate.edu/~xfern/classes/cs434/slides/RL-1.pdf (visited: 6/1/2022)
+
+[5]Online vs Offline Learning. Banghua Zhu, Cong Ma, Jiantao Jiao, Paria Rashidinejad. https://congma1028.github.io/Publication/OfflineRL/OfflineRL.pdf (visited: 7/1/2022)
+
+[6]Direct Utility Estimation. Oregon State University. https://web.engr.oregonstate.edu/~xfern/classes/cs434/slides/RL-1.pdf (visited: 4/1/2022)
+
+[7]Direct Utility Estimation. Carnegie Mellon University. https://www.cs.cmu.edu/~arielpro/15381f16/c_slides/781f16-12.pdf (visited: 5/1/2022)
+
+[8]Example of Direct Utility Estimation. Shweta Bhat. https://towardsdatascience.com/explaining-reinforcement-learning-active-vs-passive-a389f41e7195 (visited: 5/1/2022)
+
+[9] Temporal Difference Learning. Andrew Barto. http://www.scholarpedia.org/article/Temporal_difference_learning (visited: 1/1/2022)
+
+[10] Temporal Difference Learning. Andre Violante. https://medium.com/@violante.andre/simple-reinforcement-learning-temporal-difference-learning-e883ea0d65b0 (visited: 1/1/2022)
+
+[11] Temporal Difference Learning. Viet Hoang Tran Duong. https://towardsdatascience.com/intro-to-reinforcement-learning-temporal-difference-learning-sarsa-vs-q-learning-8b4184bb4978 (visited: 3/1/2022)
+
+[12] Temporal Difference Learning. Henry AI Labs. https://www.youtube.com/watch?v=L64E_NTZJ_0 (visited: 27/12/2021)
+
+[13] Sutton, Richard. Reinforcement Learning: An Introduction. London : The MIT Press (visited: 6/1/2022)
+
+[14] HalfCheetah-v2. James L. https://www.youtube.com/watch?v=TpWXyauJ3M8 (visited: 5/1/2022)
+
+[15] HalfCheetah-Code. https://github.com/openai/gym/blob/master/gym/envs/mujoco/half_cheetah.py (visited: 5/1/2022)
+
+
diff --git a/notebooks/21_reinforcement_learning/metadata.yml b/notebooks/21_reinforcement_learning/metadata.yml
new file mode 100644
index 00000000..d775fce1
--- /dev/null
+++ b/notebooks/21_reinforcement_learning/metadata.yml
@@ -0,0 +1,29 @@
+title: LN | Reinforcement Learning
+
+header:
+ title: Reforcement Learning
+ description: Explanation and Usages of Passive Reinforcement Learning Algorithms
+
+authors:
+ label:
+ position: top
+ content:
+ # list of notebook authors
+ - name: Bahar Asadi
+ role: Author # change this if you want
+ contact:
+ # list of contact information
+ - link: https://github.com/baharasadi
+ icon: fab fa-github
+ - name: Neda Taghizadeh
+ role: Author # change this if you want
+ contact:
+ # list of contact information
+ - link: https://github.com/nedataghizadeh79
+ icon: fab fa-github
+ - name: Sayeh Jarollahi
+ role: Author # change this if you want
+ contact:
+ # list of contact information
+ - link: https://github.com/sayehjarollahi
+ icon: fab fa-github
diff --git a/notebooks/21_reinforcement_learning/resources/1.png b/notebooks/21_reinforcement_learning/resources/1.png
new file mode 100644
index 0000000000000000000000000000000000000000..8e4d1f9cdde5c47c8f30f65838042b4b03329063
GIT binary patch
literal 10274
zcmeHtdpOkX*8iv{bV84(g9v5EI5oq!G#_G=(ruQz2ANVVp9foH_`Z
zG&xRakio==DC96S&So&iF!SED-~H@&U;Enc>ABwbcfG&;hs*W7=ezFvUZ3?@>t1Vp
z@95K(W?MJy-UNX_ww^p;`WpnYjsgB!M1{df6h$r<{1fu|&FmPYuvLZ$9)3a{MI41d
zUd3%*cHRJ@9%|IYZq?4vc&jcVRd#$frINLWfaMK9-%VYnx
zWABq3B+GMhK=Sr)kzs2NhL^^SSKOMp{kT1;SgrrS}b)y8WdLA!(
z%X$+bkXsspqGeqOM8Y&_J>-anwHO3aV+se`mjvobSo~K8^+-WYHyC=Emhg3uTT30O
zmC`PGG7!j}Xm1EaRIy12BA*Hsf!w+{AZ?g$T=rY+t>=hpJ!I@T?U13(TQ<>-JhhT0
z_`Ci7-jly4{QvV{<`+Z8TklNEe3_T&{NYpZb_S=DjO#
zoPX12fkN%1szTCFC9T(LJEohNsoD^-H07<^lHv+UKM4juHZuh|FIEm*Q1&Y0pVd2f
z9trt);)zVwjdJF4u6OE(6x(u!=q<&c84!f#mq~2cXw`w6bPM=TSr}9Jk!-cRL;}_(
zmfc-4&zqZ-k~R`rTj^x|Ry?O?hF)(aKoEaw5=wVjaM25`dScJot`odY>mkr2DovHE8_-9+Yjw*F>>wsy;K>qs
z1?F8{e0?*a>@Nd1z`)z9e|~pl8w1j&Vvw+@c@oh}oXq
z_IjGjF!o9Xzs!PPCYOvemecwt5R#05eRV(Y0EoK)Rnk{^(A7O`PI{BOv*B*j3;Y&y
zJFtzY{Y^qwlbypxV^Q?J0q@EJ!G5(3KxQz?`<9-%FTK0WyYeV)LJFb42%x-w?3AAg
zpnnZW^we5_O6({4UZ%`k3A|wj@00(tRF5w`+R6uZ!Tw;Bxw*
zxf=vjqHzGT>7IePU3+e!CROMv0P5<;X<=Gz9j>sTu_XNRIEOi;3AWb(XnUqmK>X0<
zSYEP^zJ|dV(DS)Y(7`?AzJ4`skGbu+Vf$K_I;A6305i%`?>BopT6^+#IIj2?fXdod
z&*)}O`-F{_rxX_K!k<991x!{0jOPJ$&Nf}w;Vyyk^j(6p@Sej!^D%`cAr)e(WoVzE
zJ$ZeRML`g=iB|tBo&N*or;$MSrXTJl{o2>Vwnz6I(%@m+Ygoh2xH2@I65+k&$YrpV
zV%?GG7)T(OnP?1J2XT^}Oqus+1)vM5zhEst5
zoB}Lq_V%2azyg#Vf6NVBtnuo>3zc65!ZaCxnT5>KOd+%-5y@kT$*vXSnI}0>tl))?`E+Ho7kPhudVE^krtxhD{lf=W=Hfg*
zyOMd2p-s;hyG&@2^TppjP0%r#ZcbW19Pb{6k_Icp6M!D8Sa`7*+W|JiAebz5B9x3;
zuIpej9LwV^fvd*HKt;aW=557$tpf=Kw$!z&uWeG7){pfHuRx
z-b)(Tk=nfT<=(o004xaxMEXtY0c77LX@lv@V2T}2WG-gwv~>@QuTCfnp2FNL1t^sV{RC!D_S)fb
zS=myPFQ7aA4x7QlVBl*+)N3Ij{qzMxCo!Hj$E*B%GuSEETHIO?HmoaPy}ZC|2{U1w
zEr_0VO3vp0B~(lAZ4#0_`erL0M7C(}UDrIdCdu>ttS43gHqR)ih#d*cFSkoh-3$?r
z-gqCVEDe&0K|>E?R|z{s5h{H=>fg~@;xIcGQw8G7sw^NLJ`%S
z!Qt?&4WO=l6ND*|)rj)I-0ok4ekUMC=bP1HDXwSE@q3+lIZ)HfwJ2k0Zi$B@jhl;!i%h!f47n;@Ttg`u`Q%Cld{ih^Bd~&MrAz{A}Ctyp%^XF
zM})DXc@;L)fzSRluA3qVG1e!cD>MVfqUYvp;_^f4vfUdJP24Xnwi2Ql{II3hv!RlV
z;^?;`86MR;`*)%$m=g?i;J}I7iOgU=EN95cIAEb-6D1_*
zB5gA>>PS&C4;DFMB!`_fsH73E=Wx_=jGhE=l=sD7+bG@g?GCeE7j^&>=b4Cd#V6j0v&LNVDYDC%t
zx4I?uzZ9L*#ga%n*SU#}4a+|0DA+|#Ib4JXYu5xC9hk>Qpfd`dheMsh#B=6%FGHQI
zgXIx|sd$N}(LoyZ%M(oBxuTW_vlZKk0V1KU_U325V&Pjx*t4x$mNvVOUEm-O0Mg1t
zpLnX?MYfV#9)K|$BSD79-zr>xOc#ORAFK+U;MI#@oT;4|%&$8ImXS(U`DqcXxFd{v
zLDOShRYB7;X6|DTMhD#XH@Sy1#1;0wgDAT-#8qpBLe
zVD(Cd;<|gREnN%3Z;9{g@jBQNylCy=)*yEYE*86@xd*P${P-G$JrgjiTEauyY-4Dh
z+O|$0G4lG{D@TUiQr3uq*bCqJ`^ePuNa4Nrcxc-210Fh2f}s_wk)M2Tekk%wnd4=3
zDwP`OEY^Xh&KFIoO{*12IX-#xtAiMuoUtcs|2=tFXB;V@tjXccTy1ciMx+Flsvp69
zC&|$2WKG->&6*W9ZnXWy`C5$X`?r!KiZ6uIp!se0L{rn%VOclsX(sBL#TPA3(P`@H
zVefV1pnE0Em?LDW*%JH~j%3}`HK)p-+csjh;-Vd2*+xvzp-CHdc?k5U^RrzU
z8>B~Mn^F8`Gv++(x$))8?z}g?7imQI{+TO1vzMmD>o}M_oLU#yLv&7A_cc8%H^V_E
zV=9;%<2(56G=>uRR^UAo>^M-icpVte6r;fXSk~3PFxvi%xv@xYDlteSF(S=(xJq!w
zPmJpwMP{ASJLE@zIHe$SXh;h)V}as@m*0JI0@N43V1rbvbR7A&waeC~d(-Wyoupv4a~M-8?F@f2s%qyTpZZve*5G8c9zBP@O(E~3b5(WQ#k)esLZk{u
zFKn2q5s_qg%-nvBY>8tDX}coDt_*pUwjRNkoUgwt8-Ilt=V-_rFCJG2!YrJVoO
zn8yE2!3if91KePTpIppZ8N+$cYtNB26Ag#75
zkkcD&Gkpmw+t=bO8fKDYla}zxg!z6Ndlq?F!(X8#*(6HCM{in>#po{IjY>BY>*JO@
z`jopl_C$E(?sB%l1cgiR8(nd6gj%}72fly1E#LI)+;kYUf8wi^duts53i8JxbH?E?
z{iC5!L`UUZc2w9S%@K+xx7)sfSku$wRG^Zot$W^Rmnl418oqd}hcK}IX~3eOxH3(6
zEX4=6-0Co~6c&8^pesMW7l%6KrG#)GRTOeci5aDpyS5rK!7U>;ZP2q1nM>dq!xEjSDI(gQ7&YM@bs#
zvCD2j@Ts`{YQ;iT8Z_?xWLU$bQq!}B#5ff9*%Z^+Y~ODr(8}<}OWNX6Udmi;O{*=#
z;kK(e-!@BO?^Jb){bH456Nj`gz99Da?2dA#XQ3+=kA3Wn3LMQ)*EJWBKA0Zs=2$;S
zy-@)&tjJWtiVjtnme=yIXqQxwp9$GT-K!txH`SLM%43-`t+1gAnR<-<$BOs52GP+Z
zl*6jzi)e3$?5Om7f*LGq2S}@O_R#XG5Dz+kWU0Y&sPc6M`i+k69bdp);7I
z_Go|c+%B=$yy$_)o1B7w3cCiK>~+P8Jb6^Erj{qyG*EAt;FlVmZ7G#fXh6(w()C*y
zvXm&;-fwn)QFfd{3X&)xlEdym^LN=6)LVCIHhAR&o%|y`MOCfsG$B(`IQ`0l#-Te@X
z>IaVYpeITvj-uuNQf6UX*`x5b_9Z>Nx8E{lC&5dfUG%t|Ek#Ns--w>x6UM_1R{V}3
zhUhn;&d
zp7GT?#BG?XO~B@fS68RG-#wVSo@{(c9lm#@kP+zqaY&hd$QmMS_g4kf)#W&J0)>$)
z2X{;V;yFr1=SYgYq_li6d>g7M$3IAAy~yo_6`I(_r
ziyQCHA8O#IEy(bph-e?k#YAx0HQ`?B(#!U>f@4;y2sF8MHd*_BF>jxLT^;ZI8`w+6(E(`nQ=FsnJCbj#1k~=G$Sa4w{qva+x-f?WUMa>T+$b
zbFf0oirsS1!A!EZLav8#u1^>GfWYc`i6E0hZ+peMV>T%3NKO+Mq#7-jlY@#+e+oIl
z183Eh1@*}rY$b2fJBH0bDx}HG(H%SXNDiIzLJC)?*4u-_hn7@kX`905K_5zq
zzcB8ZKwNEhEqh{w&hC;uWtCDFs?9H84JxtA#+hSXPPwZVMjmFRFi2x~13H3@>pS8#
zvNKf0Y$b5t&!x~pqoIO>;E);K-b5O8)|MV
zOX($kH+yvNCu!X;E#5|$nh&b0VXEJ)=ZG0KCNG4RQfIW?41>S&6{hb2O826{*O5CL
z!Xch#3K4iqq&v*o^40_C?^!OApb2ZX;_-E4xchzxS!x!gk4FS$E%Jt=1x{`
z+*-@!1i*JJW%jeu(}vT5lgST=49X!K$FiYv`Rao2(zo$o4j~Qx-K^D~<;9#d!4F{J=sHIW&8kSjctbU9_wWYdaPdJ(%5WF?
ztvzxc?auRIMqR%25+nd8VOxw0aMQ>4Z!rF}?c(6aB6_RiC@$rmhB1d?Ovk&x&Fk&i
z-0n0jZH!t-{Oka^Rt@fHI;&R6-VawI4u>60vZ;!g%G79F)VxqbJtcb3AR#Z5SXJ5h
z!mgE7BgZe18C62*8u=KJgJ~_rS_=?sNve;k@rHwQ}CE*_TmHCdWDQHoaK&e&U<);x2UGmUF9
zOtsNP>}d_^^OshWt5_hB+C6ZwhKD##Vq1fa`sQUGjT?&GA7^(}g9^ADA0pRKh4UjP
z9UzbIde3toT2XA0R;ER^%6}2G@|aK4%FRF(v6{_dVErQuozGmBG&8((SQ=#8HS8P5
z)g{sDoglulwmURfjcV1;nCi`vmZgm|o~I^D4u9OTaC3{SwdJ^-v03&n&U~KJ*5Fha
zOp2`u*E%-4p}U9@W|l-(F^am9^U7rXo)Pme{<-c!hqJZ?AC(2QRHq%%cr+|3Po*{(
zl2Y<@qqlLQiR9uPgU_b$m(D9PF{H}+WkEZCGHlz3za`we|EQ#zl|`AHNIEos`#niH
zweyd+-!-GRG5y;KerQ#}M@xNw^6+QtD^Uz7Z>5)846H4O;CEscq)tgrS10r5TMIkY
zQOd1*`vsQM_@FLnpT`@QT}m-g-n$X;bTSd&^*`P
z4W-?E$gc`gUk0(`4?n#VR4fD%)?wym=3@u2MqFd$WR=EQd@4`IpcYJex
z3?~DM&S{Zi7(I9n+8GZiT0SE#ON%UB=G7QZ;H+wgCe3*KJ{kh79mWlwTc%q5DDu4u
zq5a)FjX<`4P|)pfbx*HO)#RzHtIB9~;_GBq;IYCXe<=@BCE|wpROjpXIR9#Sa(tdx
zOx`xmWxpR?2ak7Hyeeoq$jBjr8t!^te3lSYynn|I`f{9Z``2Tzj-uSTx>5p85In!P
zrrPIn)sl@>PCk9ey?^m^&+PC}Rb&;c0x!|WEL
z;9^45fA!n;|Gg&TAFEMu>m=UWvCJ~xCKdw8GSXI@zMHDN`mz6SzSr9={z_w76(%WU
zT}>>CP02os=wDnHR;wui>@%8~_BS^L{zmx6^?`p5{sAa;mi5}is&+Abo7$fL8+&Mn
z{(uY1`@254H`|w&%Nz*9lvI1C_6q`DF>zD;PB|
z?19X5{0Rky_W00W?p^%1?$rG5rGMSCrV|{M&5nFhblL+mdvSGHbeG1T%bRgm#(X^}
z6O!|+_MDPrXope%w>|4WY+h3CBrVs5F6o|AIfBIq9}DSOtkksm1Fj{lLHLi1c+rNv
zB^Lp65PZ_jO;__5EH~gqf+4
zh_I9}1VJLFP98r8LA(_3Gz$rUH^`RbyTKnGpL0g~P(jN!GWfB^<(SDa2r7l_5_pFVZ`m_>lY%wWxZi|*>l@10c90ROi9c+=!m{E9T_kKVw;jwPY{q&9eQFURn|ZT@_aO1ZIPG>>3CPr4eZ>
zU<8Fin*Q<+n!1iwoW`NO
zE?@o?4nb|Ym@`9{^8U296fuFG+vHDML+_I?Vvdu;j=iZA5QtvTn@qWgB$lN8E0h0R
z=MJJJZNO*8jvYwU*T0#;izEDWWob;Srt3OA#fa7+Lk(%u%^0F4Ivf7k)<&Jr_2{FJ
zhuj9GXXGC{EN@+q6=IJ#gVRrS`FF+iPe}BCr)g^R@LuF>Wps437GbV)OcTkAnz_FN
zJx|xy!18}uVc)i$gc!Xe|P#f;gj@I&Fc
zCsW>3>?^CAH}*Bmu62}-Gpk_mr>@;6+MdvBOs+MTAbCcgqy{|pxU
z{h1%M_%?={3GRN5VI#;VaDq()wW-aL!Pmg*H;q}76m8M%6N*t;i;KO!^TR6^x+%|0
z59QA+>RL9;ELnB(+yWM1r_!KGMWO>gYoNE6-aJr4|32#KJ{I3?p$ngdXHDoBYqZL~
zxTQ_J&`+S}6DSZ~$S=6hiWKXyK)Yb>0c*An(NV`?gOb17TNcT|QzY&!_ZAc?9G)
z;y0LeH?rwY(eDOh#>bh&(p#Ke>Ew}01Gil(7RKo9n28I1!Kz@XCNr;fC5~!f#z$+^
zVkeq=V@*>AVkS^t1J}*>=xx@QNsa3){b3%GbkdGQj`^tN?{wqEr!A5ivAm(+6e+t)
zXRdr>4<-%JXfHpjF@hpu9C4Y<)3Q9SKnb!(x?T9o37&8)+FPx4&_m`&>RShfBDwyx
zDVFE1UN&DxQ9#yPHl*&{cHeI!b~W0*{X
zQDNrrh01sER9U)WSuGV)-e_b2W9YS;LhE1&Tt=Q3=tjQ&5@{H@FKXw5sElOf^iT|a
zvRjL{O%i;{V5-39E$YiV_P*3*R~FMvLojNh*y{#+^hiJq@qJOo&r)=(Jh80A&mDcy
z^-=?7!fjfhU+h7R$C)x2281?|49jeZo*+>@5wN&@t?lU-_xF~Y3A!eNr|td2Vy##b
z3T=m@T%)a99aF;!ml2N4OCSCG;L6cO{!};6P1^9G^K^7Yo$#}a4ctah;pp8H!OlRp
z_CGmlK$xq22ubeMmvQXbjD+QsbnqnZ0{b-^y=Vl@N&W5|B75U(N01L1NT+zDfU>W;
zks`NMVL@#>es5*qqC_Ay?tu7=|Fdt9gtfNs5KQG;4>q{!fPzh=P4Z=sYl3Be_fuaw
zZ$8*rU(BZl?uXjAn|tBtCdfenET*lmSQ`$tNplyOr)4ZUZ&T0RJ5SphAgr*i{(jmT
zhyZj$YW`PkmU=JOrNFIJ1LzZQgvi<=jE`vv$p5!8<+cU)R8ZU`J2
zq^1uVIRz@sd~#BoLEy%79$w8+#~DFVV7RJ-aYK9N)KbUlTWJ=iOW$L(I@@SPJAyNT
z=n5WWc0JVUIva-4jKT-(-J>_cJ<)qw^lW=rhqd|Aby<&VcVK$++_#O-GB|2w)-;><
zCG>2)Ma{%bsYq}xc7_jKl>*~X(OHjS-}!&jB}L*Bt+Zmy!UV5Nw&rhd%k&kA%f7@O
z9fpEp^>lx7CQV4s2MpYOg4D6J(`z^TsPdNHERMYLK#P-z*RT9{!Nxnx#_zOq^`5_+
zs5A)Ip!E=D$llb-=*`@T%<4yH)ne*lKW&P?i@JT~O_f&l(jgObX4b00yZ6RsE{qk&
zp~?MI>Q-ekWPgeKPp5}njIa!jVE+R
z2aoWcsux4n==OoN8qvJR!Hd)6ninYDGuv=46E|?frroZT_jrB9xAgl2@~U}GyO}{!
zY+x;wyKi458M+A6L)KO1hf=fQP+>LIj8zvW
z5!?pOE9y-05!pjsb|kl=PktbEu9gJR{iPoDLwut
zpYyE;X0=T)S$j&O)if*N#9kyi@o1nH#^+M}NkK`jm=JXfStK9KkT9t)X1paR&h+*5
z&5mDy(XO`b7xtO=jJnNU(9^|vwaAFn?j{fWunuM_oCj#pP|hAkP`4U>
zgWX-T@W@RR)t);$JFa4Q{vSlA|2whNe=m|&quK3j)z&z8@Nt$zbyc7uT(1yw{jxXC
zVRF1Mc#()`5%ZKWT0#B_AN&sDP;&~qz|HDU<8-fGVak_nyb3~&jcRXX)i*zfAiY+h
ze=?5$`%sz}3g(cVXj`2AG*s^R(fs`Ot718USdWc~Bdpt;t#MP$_x^P#5xH#j(Wf=?
z!Glq9m=1?oG*&5qKRW1uoyQWNDfaX|7c|5a%7e_44}5!)YQ_O|+A*>bIJd9|$mDQ^xnexCmBGgkJaSXo|P
zCbO`${hP`HDZL&}s;y6!NO?ZSq!L-y`7>^5^F)77V@>}D8RQE8tS%7ou{rmjSS+-%
zi{1>c_3P2UaQk-Jz3_}}qPmvf&HfVWXTRH?u}uVRXmD9y`wp``zcgTC!uKmPw=iec
zf`>>Qq?3(alRAF521UX`SyJ4a?V(x}3rxQM0SubqVB4bN=MC)D@-fyZ!5+9(MA~(I
z!Z8r;$5K};kF<2h6cYJ58S#jV^W5$&u)Il;9o>$4mU_v@OY9@gb^RiLgY3dhlsArs-jkIAd}JTJGt)_?DM6)(B8`*T3)X0I
zIDM@07pjZjRd`Cu@@|ZloAD=SA%?oX%sv!%sdvVC(^!x_f2~_hn-G3TC70Z|NR_rV
zL=3;LHXtX7y{5t6M3^B)0_yVfO49r}=U8(cWjI4r{f#AuL~Zr>hsng)E!MRAMLY8|
zKTr}i`+D{c$M}v6e4y9{KI+o!;1#eFCznXnf*!OP4{~jIbj%hZK=
zm6Eakb(Tel(=KsV^ZSZo^Z4I;1_w{BsXbfswhj5H#aBU_CO6Hz7FiDLIqTziuE
zrcr_{N^o?{_AMxGsiFl2?Kw|7)*-iK-n{V;#bdT|Y(fx@(UIgX+|*W)+5
zf8JRi(^QluXyS?>T40W}=pO%7QqtF0_kDF;S8*U?$@c@s?oEiTEMb7}d+%C+q?DcVkdWw(w2$JdrZ)SA;K%@-yBd@yOz{0c_
z&Q~FsW~@L&jwAMQ(3>ejl(iVepb0D@eYs5I^LYS=wVGGl!U|cH61nIK>T7|mkZ8L;
z9$g5FVglgV5fF57N1cC06g)6)Tr|Ft6Njy$N?&m}l`Xmli2l)IME!(iwKps`FeW`G
z7U5vdUJmahBUtz
zaNR9?1}IfY{VVMZ<|^!>z&`1Aerq^#%VUM61|Lj1TH5R3soh!T?+VAgUYL`hqL*XM
zs#kr;>wri?1(FJ@AP+GfS(TL!B`9Fp|Z~<8>|rjWQ6%oWuoS
z{hJF%jJN^otTZyQ&&$i{q{*~olljqn8AictZiu}`MTmjF1V%RhHeOz;&7L6;al>!U
z=kv*J{^8uz+fuYOzoKF+N0abl?(G*bO6_}aaG4FY*0xlyS)~cTpIGj;xK7WjUo>95
zYYJtl7N-e{oHv=NN9N%<;iD3iAdwZ#VfRfI(YnJm(}Ax!Xq8%{Y|%~Jv(q1Tjtxxr
zA}n|%Hux+A<{F;yt4N@KBV-JnQtKyadXH`5=w*wFfTmADQ*lhXtemmte2=#1#qHT2
z2N~66%A6tEY6Nawo?#Vjc<@f7ncNm2+yhsbnQ=ksvc}ZSKRRZ%E4>F+cY#o9Micr{
zB`T#9bm7-p{cO>7+<{r5h@8^8dzhrM%ukgqp%>z;lGn|L!hVJo_08)NiT
zggU&=W<+Pxx4qXlUsLSTeMX4gqA@=aICiY+pr*BEyQZk*ay?J|!IwICoz38=oqod4
z9kW{-XT`R^v?tR)Y9kBqgh$Ts<_xD2-RJv^MPGr)4c
zr~hK+J>RAF-rabeHDDN@d)p#~HY5bM8a
zOLFt#l=!lO^Y{D00iZU>VJ}rq0H8D|DCU>$b9m`9V7C~8I+N|{cvD~w4;QwXhlISe
zl3nIFrls?@YExT)H=>mJmWoXqZ!mNOSQJuK>+8mztT(myIMRl&;$H#?skT?HJx%aO
z@mhV%m-Cw+6(5aLc2c(|=cOSC5(~WwGiW#FlTN?G0J}ka1tpzhm#S1ZqJQnS
z>W;FPn0k|7mQ|lO$97}h#vmf$Yd|@-u%$+S27Ok^;9~lcE4cCzb@Z&eAmf|vu
z_@q>eHqA^8$4Amz#Exr*)KjWb%QuFK+4tU?D5lU3#__Z>SRvEO=cPs;Okh~XP+4ix^{K|Zh#
zwZ4QBi8hiA(XlnLu6EdcqpIos*AdH9D%?B?71!XJ2s_Ka)Js;2M&tlYT3w1rwgxEd
zt>bxHrI%wFV&vg%0Us!fCnBlv;b+=UoM};hN(lQi_mOudLRoK1(gf)}4~s|Vc73xk
zu`U>F@)23u(0QLe0o(-1(5j1?;hq0t{uhAZUS;LXI3MhJ8oNZU@5?0v+!_I_mJi5TfC`aC5K-}
z)X62xTygfUA&{{L;)Zev?6CxJOG8;3!9@{4dE|XM+DI^ycb9+6<-0oaXUNGFv>vgY
zrk6^@VZ~K&{mSC^60KNsETgzRyoiVM>7$C4
zVn=ugn8ysC#b=F|r3oBtRh56>LXLi|)O#tEWyUp`dv6?f{~jYN>?xLg^*sDwZj{&Pu7@>8SViwOJ{D-=Ky2_1xR~q)k-;!HYBy~?-$J;mCkjiy
z814Od`|n@dFbX@pnUzpS&+k)OZy(*qm}n-$yhV{%R*7o{E8
z0rPZ7xQf%5)oL&o<+^Pq%3H+l-fu0_-6-r|`dfoc?;BQkq8@GWnkW|NZ@HVK2zHe)RVe-T7K7S2n(G4zo^Ub(pV*(_;5Ju>MP3!a)g;>@~x5n3mQ?HScZ
z=k+nJV3Q|IW=#(v^PE;AEg%2i_*{djLA5KEQLgvwU4|tPmfnaJ>k!d~p@koZ@H)qV
zeT%u7LB=5S#vE8oRHVl>w*&UOzn49AI=eRKQ$>C9LKV1Xd;Qyc8}sQ`B07Dk=r_l+
zMa6(8%g!O7X%_H>?k~4}$1j9<6dvTdF8Rm$!?OOsMSbq&Ev`Oi@1M0rc-o{eV0dJw
zJ|>&HNO4xtK&60-hRnuXx9`hk3J5~hPughXnJI$*x@R}y#%!plrisRU-)Rf{RKdaT
z94`>YM3~7P1fKR(hK`bTkmy~x!&^3j2)p;bTlrhec9}~dZGa+y8{50cbe0`PQ_jIk
zN~pKNUrsCVfGpyRN6^aK6tGffjG`P)*SGjv_2f=Xjfs9t?tbAh+qei2fh|UL
z|4T)-=riWKeGmQ?{MyxhTdcga&m*Tyv6;P4qd`rW%A3U4@87a(gB*PLZrHaAx7lau
zQqvWf@b+f0>SjuqS0Rzh7m(Ji(gk+idI%S6u{I`%Swb&nlUJ;MqPg>D<@A9t&W~9y
zu`*bW``fo}G*{dp&Jw6ZiiPX`LVokuOzuR+0@X4u*O554)P2i`T#H>XX9s%Wg?$k#
zH7WEjQq$oPwdMud0u2R#v`bieDb<;1x>9uH13-C?fKzLhdef;heEC4|F44?No@KM;
zcTD;smg+x$cr7u4sLM=AB_z)+^Sv!1-zXoq~$f6XYQm
z7keMytD_Y#oLF~_NlNx%B5dKQD-||
zCQ8Hss!)vLaYv_PX)C!^WHwH6AF9?g+W^qtp3lq>hf49)33|gz#1(%a+m@AOIPS1>
zo3~C`nnqwK4CE1N>((;V4YpFGBj}NJi>Sj=Gq+`@)=S_wG;g950YzzUOHSuxo}{ip
z4W6X-OcJvq@--l6eKlT~(QSA=M-SV3fK(g#=^3RD@j>3^cSMyw>c|f5Jb2*wGLDesBOl3ErEY(oxkW0T5W=$a$Y-x)Tn_V
z1Wj!P8dc5yz}yBwhn?gNFp^ej`I&ss9L%7Mv6J$mtkV~K@NhK15X8x@XO~~B=va|>
zpnw?sn*_Dfd3L7+p;S9Q8A_pDf*0&@_Y#}ixo8s~F-`VBmm!dipU8)<_>;v$^wT
zwShM$zWqSDdhhDfSi-Kp>@vH9GE#i3A=QXoKiADJ*{T5K4ZCx=BW228)q9)|@QZma
zf31H|p-)w;=&pQEK~hAnx5&HFq(buxeq-a-*!6SiI8u^G)6!sDVYcocAfb4=(>z+WwA<>#mFF1C{P@;!1hrE%8-Gu5DYl96d@L
zXwbh{nD7VUkO$IRp&vlrFaFa44EJT{_QL(cAH~`{5Zii_vLpn)k4s4&AL7Gf?~Jiq
z{s~0_q58j7XYzlZjQ#JC{{L*H%^yzc@5(adR)igEt<2Hn^R~-lK`1(jL@z>JsbpWC
ziFY{oxzB;!`p0BTW>=rerU$X0BWvo&r((+&Y8meS;t69`XqwEPYfz_
z5=*-*LW0Hu7266kq^CSR_FM#@y!qxW_Of!#231Fp=1M7x4?0=@LXUXuwSNfbR9La!gJzBTIYT8r*j;h)mj8Nm<*w!
z=3CV?KKRO+2-8yoIW@t*tHa>dKz#eVIt*^2-ppo+MQo4M5{*N}=aVkA3l|e*rj%H6Xcs+O`VerA`p0SC*8JlPYhMM?QY;
zXjCc83BnDen6_f#yl~nU{OF*{ax~cv-N~a5KpYppE9Mqa;g
z)WzwjY@TY;9rWj2uLpq*JmI2Dg|OF(&CN4dH-5Is2~xdJc>5VnP8?I8)lx};fLDdW
zN&&na9iS%w(!m0!;bINL+<@bzDzDd@WI!gMfgu-83J)CbOEgV6GhQhjV}7Ol?W>&4
zsn@|Aq5Ay}AYDuNsH~zQZDLN1KeF~`&OZ1A5W_Ekvvlll;v$k0^8U(~#f0sE?oO6kqbJ)BD^
zFuio39QI+%EGnSx%Ht<&sf46mv;plUP_h6n9A$5*E*k((n|D8K`V#7AwNFI_HSd5h
zKyTdLxfH0Mr*Xc*5-CL*P;os*^K%j!Pk7Y~K{~o4=)$*Udhw+tQuoiS{+B!Fg2%xo
zY#RqoCf!RQdP!On@l<(we5V{rG4<2&>YflY~AX)#k-F4_Uzxn(9OX=R8r
zr6~EWjw5wXYg!0$015roO1;$U{pUQiqBbF6d-QhvZfmPnWhDfK{a)0!J|C2tfZB%j
z+x~N{pQEd`an!cp4XQ3%rTR+ILJ?sF*3y2m3uJS;#usm=Z@!#D_(-N2)ssl}sr
z9#mo20}eKK=Vukh6>ze3tCDxp(;Sv<7K7N_qY3a(G8sMa43UyCrp2oo0JaQ?
zV$V05|3<9_SA)6?zt#Gg2bt_9l2&K2a#`KHkRAdIXIs~<$&V8#UH)+%*XX7j^f*h(
zR`%waw#{HkO|I@U4Nfwa*%kAW_p%=Is+|DGX)%g7nrX*ZuWHckLqIz~cF=vVPaKQX
z;qcLE6R4u$9izIQSUuvZd)q$S{7_;jz<2t6Tc;PZs$!-=dyvPSmS$V|Tg?1_dwVs^
zG-#0=RE(h*ng;e(Xqt>|3g&=2QWgm~7UKakmJi^I;Ot|zuaZF_&q?7nC9pod1nJW$
z&*i9FdN?mvlN$@=SwzCbiF2!0;Fn$gSqWMo1LJoKMjJD}
zxna$0_?Z|1C(KGMrS-gx+(EqmLWlVFaZ?Xs~D@QMk!Z;3|&wUTdo!Ss#*afSm1&DbY;A(yK8q`G*FU;bA>%sH7W
zKCFMh04CMHah;wGXf|=eCr_|XfD}QmWaIRNH$DZ^-Z#y|Zn@OT0+~St8Ak}nPlqyn
zo`}2&YgRk=v7hwm>j*bvcpSD`!@)jWt(%OSnzwq|g;VoqmiYEXrbB!N;67cty6;nS
ziS`EOFdsXh>Zre}N#sN~Bq;=@JvDN)X95)O#blJf_bq?5?ex5hiw~$05|ljtU)H|r
zaKzv4Dz%n<&Kz8F`}&pUMh9Cve8(wKziCVjKn6UTmc_dlyJ
z#smK9{r|sBpa0K9Z2!)ve?85AP=3*!bPANmME-fu{*Oww|EELgcc;iJd|NK0>7nZC
SEI=_lbjr~5c!9p-t^WX#>7>5^
literal 0
HcmV?d00001
diff --git a/notebooks/21_reinforcement_learning/resources/3.png b/notebooks/21_reinforcement_learning/resources/3.png
new file mode 100644
index 0000000000000000000000000000000000000000..d390d5bbbc93c0bb05aac0927113c248555b5818
GIT binary patch
literal 12179
zcmeHtXH-*LxAu-8hzO`CRlo{T1VoC57DNRLO^_lWph%G_QbG@?fJhUS-b9oRA@m+V
z0Tlu1HHm;oFOeQda#v9AIp>b^zVCg{9d~@+_{R5x!N}fguf5jXbIm!QXFkDN8mdgY
zxpqSk#B}-61#Jl0F%AA&chP|>9&Hy+f`2e)ZB=C`zwOWz_(5xRPW>DNp(7Z#ZtVoW
z8SY)W;S51+C#Zk0%i6~mAxP}xyoIZ5G$
zpi37rX&@zKWgZCHzaQ-E=~FFw$nVi3SqM6Qyc-4`K9b4|Ho@QuhF^2GxDpBkqxde)U&Ra}ir#Q_dob)RZu!xka2r9%o~Mt=WN%J&)=KL{boX|K@xD
zute+GyQ*=U4t93AYPMu2vSpl~@8-}*jfG)XizzIGFG-Bmj>j3#C#vJim=5WP-H<<%G~=bBa@nzxde_#f3oi%(rYuyIBoyguOAk
zN?2)>LmT6!v8%BP%Ngs-q@<2`5yuYJROU4MqDPf^V$j=y`(hfcDIqMHiD4T;k+_wU
z`Nq`;oCVlM-)|w%#o9PJv*Qe|e&?oXi2-M@N|*!sCg{93f#a)5u?uvYaWfBi%r1cY
zu)BD&Xt>J`QVTbUahd%TzI>2j)!Au!)y9L~@7j@W7|}s0x-BuJLd?eAX&_V8X*{IJ
z>FcK&>lg&|UNe(sxyQyh7S6pyV|)=AHO?J#+ksqZK+myHndhZrvMMnq?qj)+ZZ1ve
zu2kmF0#~h`T_mC^(Z1hIJYY(CU<}J(Onc*2*!QDp+%co?Eo5SM+~Mzr*%(#uA~9#T
z6s8;ty$N=Sl{ZbGr{P-knKn|QTT9bVH4p>k!ZEY@^zW}^(p>MZ-Lh}5eav{rh
z(j7jL%B%{tlje0*g_
zZGM(^vtzwlJ6DY2-p3#g1%oxD>&-W@=tKMQ7K
z0<&pt_G;?3si!FT9!_tmJ}o!)m4LNy`1Vm5^Y6pKwF?b6vnVGCNbOkfvQjC@wXoQ2oc28kfC
zmcK$s5t`LS9)J(uCo(G0fye#dBplT6x?59Lt7C$*w)!+HY);d{lJbJV?-mG<3Pxe8
zZzZHZdk?k5_lEBtGM&|-zjO|GMs_E88G(1Gog3HBTf>KBNvyRgy87(_Z$t^rX(A``
zBznbv+qV#nMnsI(i-E;hgIf`UaJZ-r>(DD*3nUT6Kbo|A(|VpV8>sDfixs@f2A=EW
zOT17*nraVIwN>ThzIR@*ZV~u#a(6|fq+2)h$`zRxZpt+ct6mV-Uhsg&eimux52XRJ
zbp>G5k@&&W*U~)=9UX;rST%fM9MZtFAyEQOm{9d@`FOOb00!EVphZ7z$Vv*b7*9Ts^
zoisS&?ZKfnS4q~`Eq?u1I^izmnhUa^^;6vUb5y?z*rXmOjN`_yA-`vwy7Z;vyk{XE
zv#-Chja|QcvD=G%5b;-%@{1Wdy)WFHo`W})SzIzAtwvh~a`y*2IA`0uDgjeM0uO7o
z9KSp)X3nZ3$N!PGP8?j}Eym&W_@POb?T4t=SDVC^HMjh@Yf>$>Psc)SB1QmhtOBXT
zXwh@L+JvK`1rz27ZARcjodh3Bj9*|BD`DA1-f)VG?`O;VAhARI0B{xoR)R)A2J95v
zTRYuzkcBTQ%Rf4Z2exh157fn@gEcbCT4JAMJlnA+2`uzlR1j~8Z4l9>m@gU)v&y8|
zlR~|kRMgF<>h&g6S3a2j;75E%DD+mK8|JAy{dx4cduPWrYd~bILq6~0E6G9f!;g^x`B3p8SAtbL&h+@%3`-zDEXZPdgcEhgHi9p_cV
z#V(2!bf(E$yFEyD@jC!aa!k(L(v*|Cg4NX2p&;AWORzQ<-GO}=4t0WMGN~_MZ>C#%
zn(AWMGioyn;&G>|5LR)}w&z@bGbe`CUD?=};g*M4f`DtBT_97FCdHaEFL(wz4>rz~
zQ$l)^<|Fm(8^#DPC@#D<+PyUpi@Mi&wXd7L%FYW$EnBXOlVaj_YYLtkMBFC7Q$93B
zyBRQ)BbHl$Fu~X27?zz0Bf(dl-t@v|(Q;%Wy2}KT0B$IyHXniMHPjW0ma8X^vs;-v
zI2=*jqDDef<2hQ-0D5jxbkHH#YmnuIvMRF`wW`cVViPXG3c<4_%eQoW_NFtd8>!P%
ztD)>lbovRnd5
z3CvHU!6$#%+nmy<{Eald$=sx0S`EDhnSgkQ*x!+e#>EF;%m7z_A&8dM?>t}Gi(@aD
z_AfBJhK?zLkE$!v3%`oC^0C36i)0p81_u*zS<5w*Q(!GRpY?U_wAZAyx2s*oC3=n{
z$~hbr8Hia3EvTV2pk2i_Ob?JIqZZPl5J9TY_P#UHd$l=Caw+U
z6IY8iGnMBGX|}Zwo^o*djUyyu;_6GIi}V~8=9+RJhY{xH<`#xa5lb>Hw97+y=7m#T
z@h0Tm23wQfalM~>kmYKU9VQj*b4G}<>9@^Jlv`_Is_3n)7+i<1J{Ed?I|B{00S3vuVO1ml9?M1~x
zQ&ZE&;%|Htj?oO0Wf&fuQMy#k*1>}Hjtzee7M6Y(1v`)Ya~#PgC`@BSTMR>KX~hCd%v7s%%S6(_zl0+Rmv1pT
zft7Su$GIGn>8U?yL!7|1zJ?$r{I0)z;BQ&y-$h0Ln=bgrLLEYSN!+O5Rn;&K8~=VE
zYj_4-UH~^7Kfyi5^=M>jg=qBq;Vk$~isX#Bp}6o!_IyXJx9B%fMIn^5Ql`u(7`kto
zAum^D6QslCC9M16qSG+4-)HHXNLdPLny;j^`T81iq5sDTE<;?j1<2^N;@;_Y2G_tZ
zCW-M?2lPMYFzuTTE2mSvD8TCwBV!nQ(z_&hljauBu0#$MWmn>FJqI^=fO!9A=5}J@?`C+c
zL)p?d_Go3SSvDmhWdlq2(vBliVa
zJASO+S!Yp2d8aJLNLezj-A~B
z6}@I={z=*Tz+6vP#@7Ms5`)k)N33ig>|Pp5o>a_xTX6q4zfK2TXmPN|)Y#uldf^
ztC@L*1A9s=2_t#S)+p>vD&4!Za_t8rxOCV84fY*;@Wy+
z4R5X~WPZ8ky`eeO?yiukakHJ*TJq>7Z7NC@MriM-2qu0!QC~NC*Ocs5%pNePKgJC|
zzQm=RWK(HMKu)@_yIX$U@rrd1KsC{T*|0?^r%mhihe2%(Jphu-|rI-3nlE
z4YgDfF*SbjM&i6@*+MV;T5gG)BHTrsO_MLoHe?C2nX!p$AjHMm+lx^~t_;FGY{XH=
zyYGI*YZ??Ef+)SUTc~{s&z3bl$?Xi>hju9%#Vu>R6ib|^-DvP%|2hU>=^-&pLHZ_c
zIho{if263~jcDdwi_$BAKjCxOkPeAqonB6_dXhZ7ul4CgQqOYMybzv{hF-5aSlc$^
zn>&v!x7AJBcTX%|Pj}248RGuc&YXXAtoriSPm4ew(0`G~>9@Tvdu$)K*SnG7Nju
zVsTJmHP0=oDq5+n-y&{_}uwcGMSyboYizFN5f56bJUlUd+GvF
zzu7ICox6`%uAMtXZV=w(+B-KO9aB+s8rdNvTW#E!HcE{RINjP0Nl9jKt`z+zt~p+>
z*IJ%`OF56V%{}*;lD$GO6*BhP1}cy6d#9`GZ5Wn!4jF$@|A**p^!?dqwv?@MwVDKl
z@($ZsUD-ZdJ=S6#OG@#1-pI$`YLY*DZXy3Aql-bxYFj-W>gpETC9Le6sa1v*fntEaZ(0e=7V)-@07YY3J~rJxmH5`g>4IsooVpuLc1
zc27qzVW9S=0$F@ZXsl!Kno1@p%euAh4z++fk;|6{&fCS;eKvU1sTe!l#4lB1{z`47
z*iEu(KAKNakTsWD4|bPVoGNWj!BfUITDg0O>zW5wv*m_3rC+VAmP)Lo^TFlCiH~Gy
z3KbOLimW>02koSFd22g89nGzN!*sUA4(K=0r@FZKdho>Ai;3+q32Ot);*Z#E=G_8
zmV5;Ad%ZLJ=Bk|DgeqkD8Xr#x(8?(My@IrAu_iA@h4)?3mtXQ{{9Qs#O4JoX0vIylq{7I5N+2<+asdfF2zo&DG3y5(Z-gmKTyrkV&?zRYiG
zep-rZ!i&jJnlgPhhSm!^E^2QGX4wfTCkZ7xx}y1#@+L3K_kTJu#5R4W(i81V0jn`xYa)T`%pw;EfDt6L_P-PnXK&M{vJCcNSLP<)1b9RE?3`dnUf=ov&YbpxFQ4@*W&=c^
zEq48}D{pGdbb`X0E_-!((4@9iDlobX8Jo2@T{t$N&L_u}646PoGqJQRY-D0UpW4R#
z;*@B}iO=@V{`f5Cz{-sM%4fz19~wGrgl9<0b{#oO9=m6bbbD0HtG@!=j+BwHnTxK6
zY*tY>=fSblK6Z+1+{v_En%Ftf+9=Cok7SXO74X_=-0{lhzWa;(y1Z_**Sm=8+tKWk
zW%h`v=A|NR029D2b>Ac^&S;2!Y91wE&)237GTfB}xG2!UXnqqepY&p1&!F?+T0?m*
z#<%^0Jr%pYqVF#Z
zJPiIOlZcUufOGr0gyLn_9IoR)$0`^Tjp!cA|&@88H>
zvIMho+`>jJ+Zo-&*=@Wsm}mtgxA?2dNJSyxDP(upicVPzBRFHs1$PyK>5p-V7x2!t
z;CsWC=UFdQv7D6kk<35UMj@xPd4*6pGu<_{RbkV_5Jm^_@q>RZTuFPeAv(@6qgOnn
z4po`oA}FcF=$}X5R|romADrb+=~WXEon2de%xjOeEikpo`?fMe%w0xA%;Z1$S?k_y
z9bOA+DGumaSKH8C8@WKKfT#D7zMy*yWS8te;WJ$nt88ak;>i{GA8fg&G;bb{1ub3X
z()sUjP#|=tra9%Yewc|^&L_kwuHG;P&`@VQNd@H9h1Hy;{Zvz&DvQas^O#a0>hnd!
z=qtny^O+_oaukhQu(kky&HA(`oXT7`GZ(uUB_EvI6mS*sak#GbVe3(n{&TspJ5f<9
za)c6HSChlQ9p*uFRim3Jx-a9K>16sIon(=3b|1Obm-#qlW7rmTp=CzFlQJzv>
z-ijLYf@0RM`(_+}=k)DQAkls&r3&kz)=j
zyG)Fd)4=k+_QU8J;B^lEk?R^ncx8t~BsN2w!oKQ5ADb
zo{>Mzug^4>)+~Af%eyv#NW;P-JFwaWBy-eB)k=;JAKdSNW9gttx*`DmwyAn*&LIfY^x
z30Lbu{e{q{NlxG0lb%NhW=pYBEUGwBAaW7Smi0Dg)vag&MaOt-{X8LljkvW+jke*I%Ky9&;_(Nm`L=e?qN{Ut&38n59j9g}{^O0*)rqE`YL
zQlelP5#j>SAI&n>J$U5;Y$T4>PP?mFF{E*
zD)_+zwx##}%*81J8zbu*Qi!^+G`m<3EMCqhuh?c--F7<}RjpuI&ndm&r`&JX42}rithlaOO
zg#3N&N0iizE@t@uISjEF;ZELsNFia!X=dH-mOr%2+QPb?pmVg7eK9+j=$SCF_4)ZYNpZ)ktMFy7bNVUd=5s
ze;W6EPLg{$H{Q6COMh@YFWI%Md`nQ#2WeHD`|Zt&s561;ga=%OWIi8D4qsLu-6)
zc<%!(sB-2EL#*zGJz+jP*xhR3f*NLAc7`5ce665U$A_d0gxgFlUg!@E39r5KnQ<^C
zNgWzYy5~;5idh=6jvc5HHq=|{+&^95t4#Whx|(#`0(>Od=om&JQ2kuyTOz~C1fbmZ
z3?DvLWBVDy2X!Cr_V=HnTk=fvSfFgB%n+?=q`!O7P&-TR^k{zUrGaGA9T%kf%FsNFvb8}9;UbGkX9+Xi&
zr)Muw^3G9wA+3j6^w}*3rZAp83%0sER3chj{H@6V+YA)CKb86VO7>)HA?bJvk_B=H
zE!FGK6*`eCg%m}-9@u-8Od6wSQ9oXAFs(42EK9u#$aNnuDlePnY1~ha>~nuhX=v!Q
zP05@XMD4)+E(~}Xc@4k%y^xY$ih-VjUd+Ft9>}iRppZ@DAU_iZ*O!qL5|W}Br_$Bp
zo5`_4NrCzg{Zt|N*A|d=Lh<{imCQ?@KuSK!JYQtCKEssQlzAo-W`|7e+F^&!fQ_L!
z^x8`W#=*j~
zKXNM@kGGM9IKet9L(!t0Fb*j&d59WXaW={unhXFlTK1!HBn6YTO#^j$>g$$!5^mS<
z)z>B?QJ3fs%Yh5ngjeNy1*kRrG4TA(?GOEqfJ&{2w8l?T!Yy(ljng9A
z`(fr&!uC!c+;Uo$LpRz=2+tN%jS@lPns&;4ED5^LE0euFp&GP_3?Tis)D?XP@*
ziDcP{OvK#ae~Bl%xjV-7hK{yMi9eyrFzlk+<^EcQ@!!{;{F5dA4-_r`PtE!Vkn-nF
z1I?DOpDXBV&;6u)$HvO>9#$2k$qlg@v+RNpa{
zmo$FJIE?;@jN`NmwI4xc?WsDB;(w{*cn!*#Ta*7-$MFu_huQuU9mgzaVAbU|b;dcL
zUb~ogN|YOjMZBG)!xKS6>N_*!0n)d>K^P;bk$nAfCQ*gnW7?luP1$Fl57eyh#XDtc
zDKxgKi)MbB2=aK*n=>B(=)oGmol(`$VQx_5tIoAd+$O7dM0SK-b^2P!tp<|-SeaCK
zUxDoFwC>f3zq2tqaMra745x}2*mImlQuKpdDqEUvR@}kJQjaWE3W8~^5F_bQFT-AzRmz;m?Jx|42!?Sx+W`9wK3Y%kJ
z8?2n>mg_YNV#>Iys0dFsP=qANeb?E%T!+eEN823b8r
z%i%{$EmegPV?jazRdQ@>tC)fZWPU0iQ6GB>#dQ||xx^f1aTl=oXH=>$F+tI53t(Um
zbGN1sV^uA~pP~{RgmzL}!CH+rA|!(m{?1-C@5+o%HT`F9S}}$uT!95@{7&OZmyeHM
z96L=q7xvtYm)azU_JDYtIT9mqu(ZZ50*-;3e8ZI}l63-VrXxm!j5Lri4t_9qeP`?L
zjGAH`=pc(Z7N4fS{Xs~lxO
zPRr@tTxh%==Vg@wI-@dcRU{9FoOOjfXcM|L5cq)!zE_(-lH-GE%90Jv-5
zAd78cisChUB+;45QUV;KwOd>s7${O!=2=ttV5-SfV5df=e!nk^5(}8_nC57`rRYE@
zDBsvpnV;5mhX+gaba{Fi#l_$I&bU1%{}1x&+LCosW#ZNjaSw>!NQQY0AeSVPu&HY{q{Ll#68ETID(|m@hfHwDbF`(4mlva`!
z3Tgc+CLmOq`O!no%;N5#KRx;5n^JKjUYEeI7B;tyiuu7vCGcHUsKOgf8bjap*MJ@7QxwvnIR&&WRMUTc
z|6fHMprvY^`SENLz;}e?vARXDsr2ka
z9R?dBD?*wlTt^vyIuaEhKW_R@e7r3!3e%O{7;moXS%++{Wi9zW6Lg^0=ihfJXTz%vi3Q!LUObgqN|sU94?S5>y&cVu;%QFu0~uMCCQK{EDsAJsO_
za)rER!!f?I#q552{#hb3YcY^5(z18;`m)LsYTG=kxj_FjRkb~lRkG04@1eIrh&*=L
z4y5+aH^)#VtJy4{yHlrPMbB)DZir^Y{`xBdb$g!FOhE%+8@2tM0ug<`>HF-pjNmJX
zXMP$A(n5>^{cXbM)$d19S?5*9UORWeI)TsmbORdI;C;2J&kxw=_X1ck1rWONXL^m!
zpeJ9j`xS#L_^=oQ&0?l>?kUDi1^)wwPXjf^UaSE{q%nCV%ZL7=Y~3zk8Jccvk?-8H
z+9&2W&&rx-Q_~1N=%k}OX0%{U=Mskb1lZ}S52lyMv1uUL1h8j!i%Wo`ttF4&?L6m_
z@{n}nCFprE(%<$htfsHK1m=4ydEb=cR{*K!Ff4w#q7^Tqgs$BMZr0!e@g!Itmiy9a5mCYXf2gF3&0ssI2
literal 0
HcmV?d00001
diff --git a/notebooks/21_reinforcement_learning/resources/4.png b/notebooks/21_reinforcement_learning/resources/4.png
new file mode 100644
index 0000000000000000000000000000000000000000..7bcce4bcf8f27cb8ecb2c29a1f3497e4717db143
GIT binary patch
literal 12520
zcmeHucU05cwr&6cE2w}p0k
z6EU;^0Rlwnp$Y^D2?=kZXWxDHJ^Q|Q&V6^hamO9+4+ddn{nl@-xz?IZF$cEq
zGlRePJ-T_<9RfLadiRGx{r0Kv5Xi3q>envc@wQkT?Q-bH)qh{cpiyV!CnJ-mbpz{h*x5Xjl*daRH@#(n?SA7w4>^p3!zZU}2bAl-_}T4Rnyc}x)1peG6t2W)1Rek^}TQ%P3iTN&2ww8QhF{!#?=njVHcA@rG7QxEeCO
zHP~3o1@3GUDYPR~PxjW8@yf2AeWn-sp~~RO=m2*h!r0{5TCCB;_aQN7@>Cl#e;6z7OYIfaYfqSa^;DLg6
z7(!&MLAqr<4y}spA4d;~4wklVZN-t;Ai-Dim^?-`zL?kYad5jkxegUgy$e{;NO}ME
za54CLienC9UU?LErnLD1yHO9}!OEDHmev^8JcMcYL5xY3<@x+b(+Vr_roHgM^RMde
z7-u!~MXzffATmR;!5K5}u7%5rEN63{u9f3B<5Kg1GyTC8ffLo)bt^IaG(s%>e1u#{
z^VSxLF>oA=P-HkxL3VBsJrXc=%CH6n)%+(4syB*Fq4716G;NcL6?W5mkZ`c*H+P>2
zvE1wEC_Lu!rNu0X;cTcL>!&REJjzuO6FhZUOM#gs{StUl#C9Q37!$Cv{z_j;S((?1
zzMknDU@hL7tcZoiwWC7~`x*8N_c6F7A`@L?(fdjH2Vz%7`+CU!J9;u$yh8zw%HC3w
zUpn6T6lPGCTkYU?n1C;2)8ouw)5B|+p?scCCndq8%$X^(B_xWI=VJb2+8zJYa|7H+
zpV>V;yY42iHeRr`>+XIpOD$~y--paJ^N{hie3(?4MSpKnlu!y7?I0LUe;!A8laj20
zn6{0~@p0d1u92qpnsqHwokf|?KGkE*dw6;p(f)|T&Rx8I5H+o{AcU!4Zq?3)v>cv;
zz#j~6Tv%-%({}dp>DP-y{5Eka(Wb{lLl_cpx{qPbo1)ZIm5{np9TVC+QZxNRS9-y~
zG8%=1Qx2%^0|q|izA2ru(YtL6%YFYIo5rJ<*vfYdrKHONp#f7XR2?loV^@?xuS$Sp
zMtQ3m26Nf;_GGYi=W9{1Q2z$do{%JjQoCfnr{rnb63*PKaY%2Li{xvGq
zG`dX}yH}L}tie`^K*r}NcwUq4ppz0pCV~%AZ3SsdA-4k&;IIkUE4RBMD37XWM2C>q
zHHuNP?=m>kcfcIgR!45~)wzfI=J0D2>UEXT1OxeYo737Rip}0gfa$ng5lD}@;mMwR
z4mC|x?7b5jYYCxfgJWt%^ik4)3CPD-f9YtKgwRC5nqExVwx0KKxzu!b4nj)SWBpor
zyk~E>B^ZLqr&F(MyG$5T1va^?|ACJ}enlU{kAY5fG(|(cO8oBKNtkHcB@I}ZQYvS9
z>F!y0eXvW`G%ul>zwHpWXI7m~s5ZEXApmkY+9Qi^F~Q)T7x{@wdubUK?&&m{`ig<|)pLU+w9
zLm<2mfy}b)KGd;7w&OcM`bNEK$CxeztafIMKl+8tu^Hl;v6e6-STm1_)RRdt>b;Yc
zy7Ix#Zcn=8ZYR$}ZJ~DJ+Vh(>b_Fms;GB*G)`QIz;EiU#pmfM~rF~6i3hVnOC5JCb
z&~wiZ`wL&~8A)wbDA04sE$LEpa50&`+LzQ48drGObBcUyfrZzGQt
zXUkZxC<)q}MXE_Nl2Bk-cjATlq*2*vQ~6ms{tRb>cAIiyE!_lr1}W>+2W|2cSD4fH
zfB|5ybP!up_p1Wt?&|oj#XI-Hc^A*mmxx31@-SA>pU_2WFZm1opv0kPhyH!BBlG2cP
zUlx4W>GPNegUJ?l(JkCm>_JrmaMQzXPRRVX
z)-bEp=~kVfyFWa?)rBT?Nhe{{LggZJ`U&99f_oZb9(rro_4*!;waFcSz_oyzF|xRO
zztp0OP|QfPKdz$4QSC0!C`cXe5Z3WyI2#7s;(c5O`2w!M_l+?#bt=Nh%F0TVZD(F>
zoigsoHOnjLNKzAY+oN|pgYCKcfWdm&Ft>j0=(YWNe4s&-+pyp7{vP>`aj@yk
z7-7qnJMGkdcEN~cM<0f>NnmS#>%r!Jj$EHM)>52eP6xLj-67<55%=3eIRZreP^BQw
zdl0atCH8F;qw5QIh2*`pvX~AY2BR@8D$jeJk8JU_6!ifC6v{(qo%tH(HHb%H20x<)U*Oq+2@l{fYDi
zlcWLdZI6@MeLW^7&jw1_VH@og-lQyAtDATUtq&C`gD+fB9B&$JlJdOgBD2XpgiO07
zM~|d}+^>Dz2M4@X)sYM~e&N3RWlnK=)$1VP`cGtmzq^nBd>;A#c7*@Rto4n7b4k$U
zNswficHqui`P
zqtxa%L#3#bca?l^qx($|v`Xv_)3{)_>$NZ`TRPBfYn*Mhwf2Ftq
zZV)>oWn>PEv9y5QGKUrSqiBU5sn_AD;NeTHW2GgPw5uu!wdqSJYVoSinLc98Kyjgz
zD@-0eyeGpV+2Il7&bMDT=gs_gI)dfeK3W6UgloG2yQ4)))swvHKzIH!x+8I(F#pwk
zHmV>HO1|a4wTZ4*?Yx#(dSs3?cqIju7=|bI=j>e{HVdAzZ);+gu5eXUOKJG<=|j2b
zDK>*&UW}#ZKJgWjkZ4{P+Ic14NqF!RSX_G3He+z;N}2pB=1Z}3=0b@?60gkY+he9
zH2y>ZUEv`|?`ez(vhG
zEWbgb)|jTWBdkzPpPS=w+<+s19&foqBwmM;+QkCQt|Kcp=4vG)hr}~T87~pOxUefi;Qi$C7QQ9vl7fc7FzN|z%T0LVITHerlh94Vl&n9J00+*(V!lF
zZU8IBBfc0%dX+|4?s?%^glB1D*$1yZOKW*$@u5l_X`AD&^gn@{LhVZj8BF;#j`w7qk#<6S<{M`V1#CW$fUYg`r`
zO%7XIvkQ4`fRd9A_GRyLGo$MgK8!CuQSgyC)QC~)7Gv@s?pci3DAcbXC>CO+uI0$`
zL&>ibs}IjjVpT?`76M1UFqzAJ5({u*ANTXKJvV%1yNp@zwU=LOfr5)jU;FET*Dd#R
z>l$mfYJzq;d&i^X3+G-s(C^co9_~V*6_`}OHxTyHwy`lAe)=}8R_|-1X-%)*BHU#}
zT~77o&rHF(s~ZbgCS|;!YxcfetP`b&p6U1dwOP%P>v`07Y*h^V&lwFk6z$@q{&g9+
zj0*J4nWVaqUfm*x2D#cDy_95%=+YhB2#(dAnd-K^T{s_!M@r?f19Ml`@4(h;Klr@JGHf&A{d%MnI%sL_k8gT
zSU4LFF8wtL9bwQiwMbDQTWO
zR5<(HQ@;LmgJe!YVBY$WWeWxQfKs)cNmxW>ix(46+4=4hZ3^30$dUm5!Kl3>@;Kg7
z;Xa>;IL=x9s+z_d4+l3c`~ZJ7(Ey>rlrV5{6`wSbvsdTUN0;Z8BFbrYU+(YNh8V!f
zVi}#5*~+tvTpmGe$@+fTrSiut9~^k5EamxYRAX(hUxmNad*>f8f_ERBF0TAG56xs_&Jv|4QRm0Mv#Kw#*VUSIntFXNf`A9CW1}p~oK*<~R-!5Jx+r
zZJnpZXDsp=KW>gIr=FKMhPQ4=NzN1^m)tK=xaMF|K${#LaNM3Sq_u5SJJWa)
z9Jr9vxmj3r>}-Mn}x;7TA|_d
z_F5x(ThA*dGON(vZ(KHOFW?Kv)YEAy*FT7CwfR$A1vpA6Q
zTU!EL^fPM6`$(5`NnUdxmKH87Zcg{YJwf2WES`60jj2IJR6D%S%c_~@i@K3Cu>QT;
zMmT<{wY_4wmZ2EUO~aDMKZ!5~Ty0oVEJ`~yAz7(l7HryPe^V)d$9)qSfm)WZF}}PL
z9Nab@-A`Khb4Ye_Psq7265C@cyy#b3l+$o)6uN0TwbmJ~ko=|=^RO1pJu0d|x~+JP
zO}A(#@si(Zk@KCcQ|GTy-rlrnj7(GEvSjZKy$w$d^SF3r_ZkN|Cm$fGN3(J}sZ#3<
zOI%rc_54-fqB(3i3CyN6rS+Cy`A>n{x^XOFW54fy54nhs@l#u@2kSF<0Ih_T)qPCN
z>0dlw&tuK;L;=}fuTJ`s{|**$fOFhOsdMc?JT9;PL}`iFN~2+q(yGBo%_H$vO$rlj
z^*b}(eDG98Cu&iFlb|gS)wN0O*b`n(QWDITDop5tK0;UYFAQ_c&42JUp2I>ed`-@-ga4|-to=2NL<5tmXuEc-1hMYHKPO=wIY>~sYhH#&PY}Y
zE&YaItP0P5TFDw{o)U!*D4v{pXKwDov3L))Gk^8vS98D=d^6Flt>gokX|v#&E%e1o
zuL$5Op4yMr@}BAHB*}MM8Zc8^#9X~sbLJzpkIX@?&trJ!A~t)4`|x55L}eavF;_G?
z95Hcu+E3wvy>ecObm6>X3Rsv@lhGBx3tIvu0&t|6PP#?{rHRme=
zfl$pt1g=GaDFgkR#js(5m+AXo27Gn__Z#Qb47yH->)j_nukopEW3!l`z`2zb`{+
zkM}KgcbqI=luYsbHy2%~#E+D5ev#iX+Naq^yW|6d3{slS(qE6ZMc|#NgNBzgHTC$r
zT(ZJ-qe`D<$Jr_e@Ib$8AFDQ%cTpMT*2k*BWYDkOQ!S8Kl({ChHVmbcW#{KG3OH5y
z7z;<}Wl}6!isA}S1xX~zW04QfYiUXT%=E(I)@B#YaO*wC94SOk^0C9COrMdX|3%c#GR=QaAbtk3|-S$0ySFXBXB
zb;7_lhTPVobmm?pz}^G_pNCIh0F%p4;a6|ZcX_r-9y5a>Mfz}qnQUd(i&a!;AB`ik
zaB%T4at{LjSqh5Vx#@z3=he1Gw|M*BZ5xlONz{~n4_LseH=Z|Bh7v6L%8w@?&2CDQ
zus`WfJ~qLQipOXN-IR`1=9OUb*tuWd18&$6PK{~p`yhjJUC!WwQe24lx6S-y11z;t
z$6}`s7DztZnILzPUb{y$D3Kba=zuJ9WXeQVkGBnU1D0pl@1t6DjCF%cq0i*<3Vbv<
z4^Y3s6MS1tfHD(UP~fz*Ha4gpS3~YbG`OzG5Vxi0R1PMChn7VzpkxvPq`oMCp4yRu
zpURBT>t=aasC*Ur_z1pIzdK1yr8oeVFB!5W47!*v5R$)ngo9vjx<=~2&Qw;TBZWS_
zlcPxjYH}EF>Hl*irw7lTg}Y#|l0PE3zTyNRHwU?XkX_Nr5%HON>HsGFde}sFZb_GL
z1alr*&`?viq_RHcy)6^v=yyfn9{yALFn2k5(kCfB!LUG27nYmUWR9$buh9hT9#g>6
zs+S!lCr$$<;%YHCVi?Khk?qlU2%>dLgNLk+jzv@xhrVn5yqul2@rx(*K7ui$58E#S
z=K4+k6*BuTh
gC`9KS2kk8DvYeg$F;O9QldIG1)CTBI?4x4@LmpFhXSF1ft3G3
zD7x}K{5$jsdC~Kk(GmKEWlYx%$aHgt=KcG`tBm^5u7^PbbK?&gIeUpvde9qY-@TFKd~Bfz^>B?5ajuOEb!!K
z>6WXXP!$y-nnyjSCmR%J1%mX1I{Y?V8Q5cVkfo>svKRrYsh7bSBtW?K!Ac
zUT?8O(Qh-r`AZD|Z>Y7ZG@Hd=ICsF6J{~NC>kDRWZA_l5ZAH*4`Ok<;G_%WH@W;)4
z9plj!1`L(i#&Xn(%z&GV$o#80_XO*4&!l<2tp}xgxyTFVun0NCm->RS+Vw7R+|Z|`
zZ!&<@{A8;eDWC<@u9mdG9G!PhC>c-*vKs9qU6*3GUX?Cl?P>wJp@-kaw5T^=^6kdk`!%Rs`hasLpQ
zu8}M>{1%P}c>)wUhM|0xHhBKt=yfsI)!tEU0sNpnlncJ9F?O>wQfM$%WEM5uCe(5>
z$w9MxY^6M}uuG(@GI9xc?7N3rX4>3}3rg?4m*d}M9=}gw-l;SkF3vVG8Zs)_JZ7~t
zfVm_u#l^9h`+8dCWC_Sf9oJaP^qv7Ku0zZ8i=jceijU`Foe_V7qhZ1PqW$sC%HQhk
zE&Yj?t=@l(RVNfH
zlzeWgE7UqdBaFLEFB=vsc#AKkY?v1&D82HC(#w^5;(Np_Sb5ij&nO%%C+xX)%=DMO
z9l{PBYUf*r+*#hy(BGW&zJrwaY{hZcyRm-~+Pf}WRS)rKqGw9Qt>M=KDkO8RQL6gc
z)}ZvkJe?ghhDSL-yvsF@nMa=P^`g@O9FFt6+s|!FR|VF;$2sNj2aQHqRsm<=r=Od>SGhVk9tvW${u2R!C~Sbb+oR&yOVlY+K_TlOcc%(Gl1`NJl482u}DIz1!c
zI9`5G6!Lu_JJ7Wt=FvszgX?tXNNMYhR?yPlm0xCn^}PayOSqUsMCN0lnS4
zV(DW!kEGpHX^mvgd4H~%jfl|2ai?1;V%?UNFD$DOMb(={+!SUI0yi&_~0hNpM
zTnuy9&)sjlh3B1Uivd2PhX;NYU$s(lbGgck*)Q7IlJ?k}a>w8N(;Vf~hsyacQJIeM
zz-$p2{c#P7g@yCzn6MYYFo(ifwT(-B&iq8Ltesn%$c7B^wk6Kkw-9I8U@FT?NcBwZ
z;RT*k7d+u0)f3U*&>x5He{|^qS3vT)bG??n*yUuH0nO5|w7fmd9Ofi^u6vp6%q9B=
z;WzI~#$$Cgy(T7yhK#i&6?*~ydyYLMuSxdOUc=Y2JHH|lUq#`INru{swq$&H4|!y4
zuzrq2N&bWX<=@O9FB)a-^w4&;P%^ZkhMi4h_PNr5?a{=#9aFRV>XfnkbQ<>8JM52dH_Fwi6%eVeB}S=-lZD^
zJco9nPs^-LUczyrB>Mb79oHQ4jc=54ObtP(bz+Xs50K8&Zt&D%`hw-VPT#A$#L-V)
zkis<=e3&S?E-`Q}TF(*|qX5x+aH@a;
z-)XmC+Y!w+@Ew4ffCl;)#&$2$ihWQN40;EKduU*?yN!W7+HWz3zV29
z>`)teWfCnmueYS{ARp`ePT>YReca&Y5eI`;oiM$A?>o?Jq}$fd-mBZ4deqB$TFb_9
z<1>8>_acV(i}e^{hdRN;9i_8NOiZ;yp#Z3sJw@AHuYG$KF}fpINOT1CC`eWQZL3iY
zh4*Q@Z(ec|!_uNBwNGdto$+f^wgI`nF2Br#hvKMyUzVKF#V=B;9dqAp!eDYvGId71tNWTd*xxZ
z{0$#ZZyvd@Bs;m{kD`P*JJN4(!z{QP`lk$ZUsX;ma;f_BOe)tur*m=d
zyU*_mi04TqHGwMO8Bo52Tn9z5e^iN#^F5)v(k?H*d&LQX%z`q}U&%@SPp#?ywt#&Q
zh;MFadverY&vTt--TNnDOR%^5p@S!P)hR#m-k+s5|1((a|CX!xyDA02q7EgJj<1(@
zJ)sSJrxAo7hy=D~q&BHd-rF4&-gnRygC-W#e4|JiFTfkT9U
zB>zqM^rQG?QI;K-NQIxTZ&*2T3~LHXvJ-94C`pwu4vaZbg3<|uOPv5FRjf0oqvDT^
zZ%_7L^4|a`HjGpS)G9C$kxreK_V#3{>X&0I_tHV&L*y~iDA}IrUCoPfjt=ej%hykc
z;6EUUCHGi)eg$!<&16k(;sMue!ZKVEgfDNGs{I>+*
zC16aV@!X}&?V1OAG?pQH=Rug^0EOXI4XuqwTr_RKoi{g6nwPWbT>{_09bO=6vMxAd
z!U3AzO%sy;EQkS_lzHDj2x6{*DVc)Y8tP0!>Qd^&$KD;CnZX`_g#XgVP-}I%WM{?^
zrQlqV
z95(cu$Mo}8SGM}fd2~zd?J%wEod&G4!2Lc~|KBKLc5CX-TQt7Ve)}fUJ8kS=4#lJn
zLX&I`L?&x601}UB^0^(pGxtggdgR4>$_2}4pY1bXjML|frIrKN;+CjX9XDdNP
zP{Vx&KtfEJQ2@{d^b>(T1fbmuiFK%NZxg1~_!Xz7Qa73hKgbM}?SePREb9;H572m&
zhfweq1F~K=gLEsDf~qCh@7fECx;x)#pxweAv3)BttHymJpNSL;!~tShI*ENu6UJ5-
zGSMXC=`1e@)fR>*12=DPPos-#!xx@e$J2~?nYZL{I(e)os!L_+5p1u*ykfr5E?jw_MhBCe5XSN7|KqAV^%EM
zfmKh!}3up+=8KD{KvsA>%&;E&3Jy`+U*Gh{6Sy$2wGYj@;pJ0b+8~ltu0e#KsZT>~(vT7XF}
z$4{(aWsc1@ZHD&lQI!PSrtoJu3`lu7zxFJP*@vQ}~bmOsiu3~{qh;kiF=>jKa)UGIV|GFVx#CU!d
zBme+*`_HT0nTAVxmSF?2G6P=^!UO4GOJt|yHEDO>#`h++ta+zoYztuac0<8iOT^V=
ztW`9XJM`64$&B5)yG>eW8YY!qm#@#e@OJ8XjUP~jf;9|ZNZnrTXmX_XRiu#J9=Sy3
z9Qk3P%KtzOLj$?Sg-`BRiJ(g35V9z_#y;W6r&ujb-zeu-XP~HqD^!MbiOemJ{^3ui
zI{}7cEfQv2`nQ7{%6Yv@5RrQE{kLdbaG+Z)|`lKpk+tu~0jzm2Oz*WYK&8B|Yv;JP%W7$bpowumFEWG_1#YMaZX-d%
z;i8C$i}M0!V)oNZP@mHx(kZ0gi>vJxcnipW0BA4p(Wu5Dxi3YES*3hDMRg26AOa9m
zaa)WHE-KDMHDx_T%aSRLPlJ!dCZ@B}H%KV#PcfKC&3Ds7u4sah=-KC~W1D#+?
zy?3hm7t<7X5oHw=&F{~0W!W*VOHphN^6H_v!E%0p)ahP^UAOFp2bk?kr4c&xMs!T8
zZ4$#Q6u5up%-ooV3Jf_$3)-w_!6IhEm0g##(%gZBX8X&nLL%fT*y+(XJGNRvgjW_X
z9VW^7mu5yhekdQ$+Y4@q=4gzSBEB2@+jJaIeF4r^LR6ZHfpw8wo7M0+Nk|LGdClcs
zTA|aOeqFqYQkdMA1z`r4K)EM+y)=BVLdETS;!Rc_2sk65!+rzHW;_uq^>N4GY~LKhdMSOX%uvMYPS@Tf55UuUM`yPAczdz+zzJqm&|6}%1T)%`R}CRGB>=k
zHz7QG!FVXo?0YB`F0754Q<$BLtNbx6w}^ha>2+3~YyXjwLwQ>)P}^2Q;@Uj!Z#DN`
z1Wf)I4l$>DBWbapEU@xz=^Kw?nOqVzbO9>)vP*R~^+)CWKL3%qZm0@VhF29xUW+CF
zT?YEQU-)xg`u|(O;rOpJ$sf!g5RrKKCcYmszV%miycz`CBK
bLvz$GSVQej`hfoffT&;BxmI%J-sArT%L>Js
literal 0
HcmV?d00001
diff --git a/notebooks/21_reinforcement_learning/resources/agent.jpg b/notebooks/21_reinforcement_learning/resources/agent.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..c3e0d9e648d5986b9fdf743d878e0bc5a511e0e0
GIT binary patch
literal 28238
zcmdqI1yo&4(kQwQc7Wg^xVr}k8r*^e2ol^~5AN>n5F`+SL$Kf;+%0%;cekJc?#VYZ
z|HwD@zIp%p*Lv%|K5L&{tGi3Oc1v}2_0#Or8W1VrW?>2da&iCz004*p0tg*|`2zv~
zRS*^c3xz?@9{?f%;QxRDKo>;#U$7~N?yq&ASRl4Pa9=>{*+3sa)Z~q`jg|XvG^XEw
ze|!HmpG?8r*2$Jk!q~#e)&Y201u$gZNWTF<(8Tl`;b|R)PFhS%PeJM38);byXaNAg
z;Qb~H#|i)t8z%=PNl`L&4NbD&d4Ed=MviuWga6GNO54TCp#x-o&MkT
zjqDtL!;4UuA7btNo7eC3TVfOw8x>{fTOIlk0giwYAPI;<;s4R}cYWIC007TV0Dz7A
zd!At$0MrKq0N(uH^Qf`_06ho*>b~0@?y#uVDI&IIO)mpAQ88N2R~102u@iMnpn@gMobptzd-#
z0PwIdu&{6lnCM9Gu&4kC1{Mw;Kma3RB4LpuKWD?nr%-%>gUc?GOh8FZ%gN27s;*(+
z7XP7k>IWVb4HvJdlA67~*;RFY>od-~0LC|KvVgf?I=N=e;;VcC@Y-ITV%K?#Fwg}PiC^98K8w$z_
z03M9tJl^{P1viz4-%X!iAjitwz}u$CcTn
zz9zQ5);CN~8K0#SjJdBlguT~Vo2ziMjbkTOpznT;o`3ag0sVD)NH3W+As#>n)LSgr
zEibMu+>Zp={tB%?+a?e4aa}kMm;4N=zs{ZNr~q(}_^d%kwxQ-=;X8d}=Brs5!q~>E
z#h!(Xbx?(0$a3-}_DQq+Mg^47ECPwnb=x^99iEh7stbx2WxoQjOBl=R*U(LuQeF0p*f@iysaQijGun+K9lW!kG(g}eqo4@E9W?ioeHc@
z*MisCLjx6vVMOBIyC=fGsWd=&E!1t#!jf&Wk|@Gth}rio-^daLkpiggE}-qKtgrK~
zv%W@NeBv&w2y?LH)-)hkY@d|nOV=$7;vg4prBJ#swYDR^kae6Dv~hqnW_cI?nLWU@
z4cx8(HU`{Te10HgGCo9vwQbs)2>Xv?2aYQqcVoT$1M}X&iqy8+U-L=^Wlqzo4Vv4*
z_1HbxOyf}{VMXhROnYxCr9N;VJHJp5Puih#J&;Cw)>&nHh)798{Ltl5?d{A7b
z@5@N#dk{mJP7WW#$c02iu36L%dR&_|YAl5;>V5p>|A7%;6|Hm*^(
zl08fBV7*NvBB9Kb1zEAFQc9fd}s
z23;dp44zcqHwkWLk}8+@IEU#WXBp!e5hr=4X*to!9!*!d@u%WnE?!
zW{`?~@*VsRIAMUGB?53>u)9zUJsy@&k0;7v}AkjWG}JL2w44*k_YW
z;Ciu5j!|;Oqcy^93^t4VNh
zjLw}B>w5yQZ@3SH(a7eBi0h^^fy;W}HwSTkVL?kTY;+#3Q1Bn5Qc=h+#%nzMUALjM
zRteD_%R;AxF9OXS3vV@6(XuE=1qdaJ5bk5g-vP7cBz6!qrze1sA2bEm0Yz&{;$HNL
za6{;7c=)@-El+?Pq2%%t@NBm*RN)uLZ(f2!{Gc)3(4AcKOLV<)8Q=^muWf5-1$0F<
z_^m2rNTaxqr4wiyHFRx}4z`<6%)j-nx*arB_XG$qN)~d5JOP}7qyc{XA~igOCnfv<
zQM2vEM2&}|o9%`2+6@#1SImzBwuA|QL4b+5eKWCZu%b`;@e>%TeRK>68p8cPRQrE?
z7Xk3}rE;(T4#L>>y-x`fsEwOud7mrh*Gk77{l+!jKeSOi8`?(iQ%Hk1_8DrI-F*$-
z_g8216R4lV9D?UeEz{qfb351S>primajz06z4>~K7Y06A$B(2ki72VCd%)R^G>%fi
zfxR|n#cSulG#S@}Jj2s|#$Mw*>7RE<57MCt8RUUT*gVrwe8HF7lw56sgVazH`Ivg}
zWOWa=NxaQ&ehjKWyFOg;vnB)=aiMRbBct6a
zxLeiTR9n++=AU>Luhba+a7w2#w&8VQ9sb}}gg3w`H}rE%{1}@a&5%ktd>Ee9UmP$O
zc&;XwS!Z*W9;CzJ5+P5>Ym<#5(&Ze6Oi{&9KtmDEkQU)IMIKa0=N~}<`we{kmn9n>
z5ccr%3k_K%6Y$odV-&$f6$%Trcp1iB-~itDn#_Ed1yUCzhmoWNgxeKaaZ{xm>lV3Y
z#f7O!|Kl|uW9%rgO5RG|gQA{M~yk7TeO~)=s7Ep5nle)$q`5+
z=M4&gp=_|O0UJjK83$p%y2eR1e6b=S$!9DoT|D&?Lt#*M&x^mv>#^dDE$X|I2hW{z}BiM2~k+|ZFyFn5w3
zyk&Pgxtwe8^7gs!JH8nc6xRw8nn^Z3TiL6&bGv)Ej$Jy-@(Qa*^etl=*l3$HT|4`E
ze?r>4A&k~vk7&1G^_CrB;iyid;kE5Xeg33xi2pgoB6*Wfw3wFOV$H3jN6F08J51#+UZmkwpN!
zQ*HMXI0ap{!0OKmq87o3Xxt#s|NLbb4Uo3{5W&v+NnC?kd%A)PnsdTzKO&I7goV`ZUQOXE`x;+-^Te
zdP>!fK7GQRjXnmSmeAO5YY+z}tx@9Z4)=y#!WR<$plJ68`1ch*+6f$X}(
zVg@^8Cr8vX_YhEs!sKJpCYK+sDm#RaH24P=e@d&0%OK;}7jG9~f4kWk2PZC$sC90Q
z3`27nB%B3M2uOB&jJb{mkn=m9JzV#2YvJ_;A5%8wt?Jv+JnPr27)h)se@~mBOB?Kp
z8wOF+_m8ks*@%CRFC?{n<((5uhuPF6GA$3fD@mOuMe|a6yQZt%7(gt$ch6Mue*LB9
zQbWgn987h063I_$(!P50Lu7|g3D_iAlDN>isS{jlmB8wuZ%peb|ifc1E
zaMhG6&Y@A5qtn$#)=Z@DO~c<^j2$hk!wvUFr!55g2-#f37?$|9kK(;X(a_`ug%PZ<
zoK@T^quz_>Y4^x>q|``ue}*~upmjH0m210b@X?+!NYf+=HuWgKw57~|WmOh6B{9G*
zr`s9tu!oc?$*^kV16`clQuv@EDq`PKmNsviuE{e-cnc%t+euEt;dl>qH3YIAt`f!s
z&DXDoT>|01dd6(1nBr_z+Kvb+y2L?N_;gi8Ok*Zem|f47%YM^
z^DZ*e^Ac*$eG*NxG_)E!NxbiZM4FgWTO7=)b2U_I*
z`Ewrz_JK@5>=s=1AMdfm9P21AW5*;{M0d`;+eBCyC9g^#zN})8?`%bO|2L*sVCFA5
zUSFkTIau3@d0hD=K+G=KiE>QRfY{Fi0M%iV=%coaNqD}>Z9^pHLu)4-HXfhNu0VjU
zUc%^p0wVia$DFqw3pdOC%~hx~4~WgC{dEP$jkmZAFR1Xlh5L9C%iww5Z*O52Zn^jvMfH%Mo>2TwlMo
zMB+O{W**Ey?r|5>=8l!P+OX@xxsdS0O<-0KE67;jwPq(JLG|yNvL?z=l(ZG8Hhzw{@0l)7
z$PSgHXz@m|C(S@UG*1keHRa}ESau3ZjT5a(nZ{AY;JNH58&Z9lqDN-H6X!mW(X^jl
zuz98AkugmpHo;jPU*OdL*x$n13T`x=Xx&q{3z7GS1HA0PiD7DX<}m
z2rdA=g2PiNudKjfA3B%W5|k4@J^{nBdV_(+k#u`e9{u=2`sxxWFUBF8Bv&sFvkpJqE0>q
zW0Dabz$|KK~Ku(UtrBpQD`7W7c3tL|QJtB?ZDnR&hd-Uj?cU_6lnabBY
zzAX3ZX5|$~`S)Kc);=-2B-Y1M*9%tFPrwT;(%dF+I$y28)%@WT(4*S7d+h%g)MhAZ
z<^S(cQNhzJQb&MY{ZvTkdaXdW0{e7fQPlKvOHZA;HKw2HyirjZ^$Hm#T*w-1$0E(!
zi4t;@uN)Hcu`)zN!{EVp_>!frOn41y28u3_WCdX^(CebOo!QriM7_j;tH|h8f7;o?
zaq}4pNj<=MU$XepE?ho`sqwX}`*8s$w0U(&5)vp`OWExx`C6tmhpoVUSkXo2niq6b
z@Ol=paUe}apTB50K;g)MG7noK`x8e02MIqvS^!wF%W4<&^dS^71_@H?t*jM;Ggm^s
z&jIhkeq7&J`tQ`QRbL4|U*~*yB*=06-85nybI}j!uf@x>s*rj*fQO4zH6-F1fgMPl$7I|KwsshuJobpf
z{%m9#`h=`~TT|;EA@IX%qRi9tTh{Iy@8^1nhPurN4{(COqws0gzGFTZan
zoZIH66Bvgd(!F+)+~9A?{toRRk{SY1My4cv??P1gt~gJ&DEoMso;%B$Sy_!gh*cB(ee|1zHqzDP$+FU2bT=aur5
zyws;O%UnbhZB~~H=#g|74@>)M_a|Y=+}c_V@t7TX93KlSIX#3=Q?6pbBd5twvzkGW?Z!eT
zOfHf*!ATd)d&SIJ+nWSz1HAi>M#NZ+`njWyZ#20izTz=d)IiKvSzqv6#k@F)5?0tI
zLw8cR))sX0n#&L@%UFSEq#Onm#+g*yrZxUyEvV(#2h$H*3-sY3sfEbmlpP
z(mmZ}^V+&2uSw=vVDU%PCAw8)|Mue2svw3WV%4lj9}p(mgdT>DU)STiCdWk?1tI!3
zuZ>X*K#Sr4oVYNu0jrIU9|{efjOd@Hpx-u7@DpI!ydSr^ay|IgN2kx}`ODC|=eWNX
zhwj*AD;sw27=#DFKVa$U7jpic>g;~t6
zJN7fhs-pa^M`H1b1k>q#3p!r$Ra!Y(=^_Q5P!YDb4d5_81sBSB=J=nVEPv^S{TlT@
z!s)*ot@U{+?qeo=6>02krrCy@BlIfEQ=&?+x=b%}oHT7rAZLzLW9DnmZtU4Dn%rR25~x_O=CjC0AE2j#_UaE%t_!RzUpZ)|$4$LTMdmWZm-FZlJg#NiH{lHu7M8UGq4Ujs
zaGCp#sEv^5jO-|>gKx?_CoGnRTX~D--kgL67$M0g2U_iZ}`k>{&HYa3wgi6yR
zTniy3cGUdpy3cI7iYca|RLi?L6Z7O|JFtq
zdG8x)PT8yKJP9F>e6Vl;cUGH%y@!ym5OL2F&~;Ae
zE9rZc=eq*g{dvQU`UK#=R)GfM(%5snupVq$=%pk
zku3Mo8+WvhyXwo0h74ukcgYQ;(I?z3WUcLdjDN?AAL^O?Tu72bjh;f^V*+lDIQOJUy*TfAsGGNcQwoV2^04?(O6VJg{o!p0C@%<&a
zI_sJRe%?w#SqH_6RbUPM{=4M5la!+sJp@Yc#w}*SN=RNnhV~Cm`IsP9uZdF--La6J
zIf-K{iIAyJhW82gr0V2#Yg>*WHumo(p%a^1^etaJa6*JUTNhWvTe$Z`@;y2>Zo3xb
zxRf_$-6NK#d<>H;*UE*uc~r)htLvF=`;9ji8eFj?Ha0eT#uLFdQw|0aB#zgfKq{>I
z4c=ZChd}!1{NB{ibIF}DJw5ra2RX#gY2F_;L%sy(XB!>+IF?I=)vBu4N_B%}8t77t
z!s)2$|E)7m^a-fd`6*>*d%s29#xEc&k#>)(4WqUBYGPl`5g
z>m9w&VO=y_7sPjd`rEI87Z1V6&yK*jxle#=%=#0sc8hwyiVp3T%SVlESz@-K-n)Ea
z|5gBn9{P3}L;nc!illZJ7*Z#2K0=_St4KqMBBoO5_R$qOf&?2yKfle0vC8v;ETT?{
zafJZIoz2;v=Y$!*Rt^2(vU_+RM`)7ugdC2Dv`mh3g4L1Li{wuS=^JAx;G|)YKCl5B
zqg3>gsc=~cyEx>BLwixVgR;tI*1%`_5E49LSu%LmRJdOYiE$ZSQ!XTi$f3N0AiemM
zNwT9>hH7Eq-i2)r9uHv!c2i8)n1Smrw2gi!*%3zAigZZb?YHNtZB8t(j(=J1!Y!8da+gQXPkQU8$%->Kv0zChm=w`0Jh1xAci`QXCKwPRG_N!cq%%V2b%1sfx3NW#uZ{L)eA}73h2(
z-{{F|caldOuELV~RE#Yo1we)oX16cvFt@NSaNBd>U;6n7LhFP8D=#$K`O_Z|LU#EV
z!o@-IAzO;V^fENEvZASjNpSge@A=r93_zf3#_DfpDCCD6LMG`P*o1YAVtPp~t1GXR!PiA34hZZaXA!-fS{vpdtj*6v&@`CL
z@3u;YC74$@&!Z4WLQoiMzg#D8Y%hOqZXzW&h_JbXg}omh*-u4>%@=C9fTnF_n@baT(4Ts39EsL!{&PB5-Z|0{u7n2bT*A5s955ADY7&H@E(=43H!0pC
zTM<&g!D)0rNVci16GMsP7o*0QaZPRn;rDcHwuuZra>Yu4*&`i#PFWBv0$W}kip=uS
z0w#heFKg5#vz5a?-u_B@0B=R1q3N=&Qv6wDz`f@)hgvr)RZb2SMeyV>7OC+ptDRGn
zxDLbEmzRTf5@JY_`ag7uTjcHKt81dkzj(Hu_j1vf4n(GfD9Pn_nsR*$nGjshFtBM%
zO>WS}i0ZXOqsLb``4lcb6fOl1OCy8Y+wrk#Qnvucj+Ml^Pea4XLOfh>bhOcE%h(XP
zA(fEjk&{~F-o>Un6e0G#4Jz@qrTgpsYqKrHQkV8o<=G5w^(6OC{L@4gT;hQH2Q02?
z8GX8=K^T}Xum%__=>-qwH00JmaJ{WxNGC?6KKVX|u^*y>9v#m&()*Sa+^-|I==Hg6
z_ew%1^_RZD$=c)^TC?_#t#}h=-dOiB8H{Q+?s1xIJHLw0qEnpdtu2j-!2#GQ~h`Lqtz7mKJK=`ik76k%xvt@_9S
zhm{x6Jp<+&?GW)Qm>&ZYN@=WW8E^p8MvA>Oq_Mesj-iV=A0j;=MX*$AGe@NvXp7D)8@ci#
zzHLHK(3-EZNK-$Ew=;h5*+;rFVUOB+H`p_POe5i#1ZU21!E%P$eZ}=|!iK+3DcQ!%
z_Djpv27J@!O50ibkQul{&(~|s@2F{{90xH6{nTMAIUFL{)r)x;0t&jZ!kj4ntzG|r
z3MBcDhjl`Rk#)%{N{a)Gxq5j`nim=F#Zly(k+vE`S_?MZYD^2tFHK54&6s+N)=Gpc
z%IT?s^8#X3?lqa}(D7#Y
zT8Uru3vV8}T?W6wm-yUjx;^bnQJPiSsGCpJ1O()>>Jt+Ds)CH-c;TLrHf#$Hq@lmo
zyD3J0QIXd}Xx-<6MjC3_a1>E@*3vlB3?k^^
zM(l=8T$xw3_bTJ|;f15Nn4_3}vlO!%-3#f&969`F&Xvy8G{>qYG%}xtB=RC6%I3?7U((>xx=;LP+jy@R_-G#e*c%PM9T+)&e|vK*MBK6xw{WKgo|L+`(OdEU
zt`?2(YC}5gNcJr4Mq&N!ZWY7akMdf*-o