Skip to content
16 changes: 15 additions & 1 deletion units/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,8 @@
title: Diving deeper into policy-gradient
- local: unit4/pg-theorem
title: (Optional) the Policy Gradient Theorem
- local: unit4/glossary
title: Glossary
- local: unit4/hands-on
title: Hands-on
- local: unit4/quiz
Expand All @@ -146,6 +148,8 @@
title: Hands-on
- local: unit5/bonus
title: Bonus. Learn to create your own environments with Unity and MLAgents
- local: unit5/quiz
title: Quiz
- local: unit5/conclusion
title: Conclusion
- title: Unit 6. Actor Critic methods with Robotics environments
Expand All @@ -157,7 +161,9 @@
- local: unit6/advantage-actor-critic
title: Advantage Actor Critic (A2C)
- local: unit6/hands-on
title: Advantage Actor Critic (A2C) using Robotics Simulations with PyBullet and Panda-Gym 🤖
title: Advantage Actor Critic (A2C) using Robotics Simulations with Panda-Gym 🤖
- local: unit6/quiz
title: Quiz
- local: unit6/conclusion
title: Conclusion
- local: unit6/additional-readings
Expand All @@ -174,6 +180,8 @@
title: Self-Play
- local: unit7/hands-on
title: Let's train our soccer team to beat your classmates' teams (AI vs. AI)
- local: unit7/quiz
title: Quiz
- local: unit7/conclusion
title: Conclusion
- local: unit7/additional-readings
Expand Down Expand Up @@ -210,6 +218,8 @@
title: Model-Based Reinforcement Learning
- local: unitbonus3/offline-online
title: Offline vs. Online Reinforcement Learning
- local: unitbonus3/generalisation
title: Generalisation Reinforcement Learning
- local: unitbonus3/rlhf
title: Reinforcement Learning from Human Feedback
- local: unitbonus3/decision-transformers
Expand All @@ -220,8 +230,12 @@
title: (Automatic) Curriculum Learning for RL
- local: unitbonus3/envs-to-try
title: Interesting environments to try
- local: unitbonus3/learning-agents
title: An introduction to Unreal Learning Agents
- local: unitbonus3/godotrl
title: An Introduction to Godot RL
- local: unitbonus3/student-works
title: Students projects
- local: unitbonus3/rl-documentation
title: Brief introduction to RL documentation
- title: Certification and congratulations
Expand Down
6 changes: 4 additions & 2 deletions units/en/communication/certification.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,10 @@

The certification process is **completely free**:

- To get a *certificate of completion*: you need **to pass 80% of the assignments** before the end of July 2023.
- To get a *certificate of excellence*: you need **to pass 100% of the assignments** before the end of July 2023.
- To get a *certificate of completion*: you need **to pass 80% of the assignments**.
- To get a *certificate of excellence*: you need **to pass 100% of the assignments**.

There's **no deadlines, the course is self-paced**.

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/certification.jpg" alt="Course certification" width="100%"/>

Expand Down
8 changes: 3 additions & 5 deletions units/en/unit0/discord101.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,20 +5,18 @@ Although I don't know much about fetching sticks (yet), I know one or two things

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/huggy-logo.jpg" alt="Huggy Logo"/>

Discord is a free chat platform. If you've used Slack, **it's quite similar**. There is a Hugging Face Community Discord server with 36000 members you can <a href="https://discord.gg/ydHrjt3WP5">join with a single click here</a>. So many humans to play with!
Discord is a free chat platform. If you've used Slack, **it's quite similar**. There is a Hugging Face Community Discord server with 50000 members you can <a href="https://discord.gg/ydHrjt3WP5">join with a single click here</a>. So many humans to play with!

Starting in Discord can be a bit intimidating, so let me take you through it.

When you [sign-up to our Discord server](http://hf.co/join/discord), you'll choose your interests. Make sure to **click "Reinforcement Learning"**.
When you [sign-up to our Discord server](http://hf.co/join/discord), you'll choose your interests. Make sure to **click "Reinforcement Learning,"** and you'll get access to the Reinforcement Learning Category containing all the course-related channels. If you feel like joining even more channels, go for it! 🚀

Then click next, you'll then get to **introduce yourself in the `#introduce-yourself` channel**.


<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/discord2.jpg" alt="Discord"/>

## So which channels are interesting to me? [[channels]]

They are in the reinforcement learning lounge. **Don't forget to sign up to these channels** by clicking on 🤖 Reinforcement Learning in `role-assigment`.
They are in the reinforcement learning category. **Don't forget to sign up to these channels** by clicking on 🤖 Reinforcement Learning in `role-assigment`.
- `rl-announcements`: where we give the **lastest information about the course**.
- `rl-discussions`: where you can **exchange about RL and share information**.
- `rl-study-group`: where you can **ask questions and exchange with your classmates**.
Expand Down
22 changes: 9 additions & 13 deletions units/en/unit0/introduction.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -59,10 +59,11 @@ This is the course's syllabus:

You can choose to follow this course either:

- *To get a certificate of completion*: you need to complete 80% of the assignments before the end of July 2023.
- *To get a certificate of honors*: you need to complete 100% of the assignments before the end of July 2023.
- *As a simple audit*: you can participate in all challenges and do assignments if you want, but you have no deadlines.
- *To get a certificate of completion*: you need to complete 80% of the assignments.
- *To get a certificate of honors*: you need to complete 100% of the assignments.
- *As a simple audit*: you can participate in all challenges and do assignments if you want.

There's **no deadlines, the course is self-paced**.
Both paths **are completely free**.
Whatever path you choose, we advise you **to follow the recommended pace to enjoy the course and challenges with your fellow classmates.**

Expand All @@ -72,8 +73,10 @@ You don't need to tell us which path you choose. **If you get more than 80% of t

The certification process is **completely free**:

- *To get a certificate of completion*: you need to complete 80% of the assignments before the end of July 2023.
- *To get a certificate of honors*: you need to complete 100% of the assignments before the end of July 2023.
- *To get a certificate of completion*: you need to complete 80% of the assignments.
- *To get a certificate of honors*: you need to complete 100% of the assignments.

Again, there's **no deadline** since the course is self paced. But our advice **is to follow the recommended pace section**.

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/certification.jpg" alt="Course certification" width="100%"/>

Expand All @@ -100,15 +103,8 @@ You need only 3 things:

## What is the recommended pace? [[recommended-pace]]

We defined a plan that you can follow to keep up the pace of the course.

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/pace1.jpg" alt="Course advice" width="100%"/>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/pace2.jpg" alt="Course advice" width="100%"/>


Each chapter in this course is designed **to be completed in 1 week, with approximately 3-4 hours of work per week**. However, you can take as much time as necessary to complete the course. If you want to dive into a topic more in-depth, we'll provide additional resources to help you achieve that.


## Who are we [[who-are-we]]
About the author:

Expand All @@ -120,7 +116,7 @@ About the team:
- <a href="https://twitter.com/RisingSayak"> Sayak Paul</a> is a Developer Advocate Engineer at Hugging Face. He's interested in the area of representation learning (self-supervision, semi-supervision, model robustness). And he loves watching crime and action thrillers 🔪.


## When do the challenges start? [[challenges]]
## What are the challenges in this course? [[challenges]]

In this new version of the course, you have two types of challenges:
- [A leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) to compare your agent's performance to other classmates'.
Expand Down
2 changes: 1 addition & 1 deletion units/en/unit0/setup.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ You can now sign up for our Discord Server. This is the place where you **can ch

👉🏻 Join our discord server <a href="https://discord.gg/ydHrjt3WP5">here.</a>

When you join, remember to introduce yourself in #introduce-yourself and sign-up for reinforcement channels in #role-assignments.
When you join, remember to introduce yourself in #introduce-yourself and sign-up for reinforcement channels in #channels-and-roles.

We have multiple RL-related channels:
- `rl-announcements`: where we give the latest information about the course.
Expand Down
10 changes: 5 additions & 5 deletions units/en/unit1/hands-on.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

<CourseFloatingBanner classNames="absolute z-10 right-0 top-0"
notebooks={[
{label: "Google Colab", value: "https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit1/unit1.ipynb"}
{label: "Google Colab", value: "https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/notebooks/unit1/unit1.ipynb"}
]}
askForHelpUrl="http://hf.co/join/discord" />

Expand Down Expand Up @@ -282,7 +282,7 @@ env.close()

## Create the LunarLander environment 🌛 and understand how it works

### [The environment 🎮](https://gymnasium.farama.org/environments/box2d/lunar_lander/)
### The environment 🎮

In this first tutorial, we’re going to train our agent, a [Lunar Lander](https://gymnasium.farama.org/environments/box2d/lunar_lander/), **to land correctly on the moon**. To do that, the agent needs to learn **to adapt its speed and position (horizontal, vertical, and angular) to land correctly.**

Expand Down Expand Up @@ -315,8 +315,8 @@ We see with `Observation Space Shape (8,)` that the observation is a vector of s
- Vertical speed (y)
- Angle
- Angular speed
- If the left leg contact point has touched the land
- If the right leg contact point has touched the land
- If the left leg contact point has touched the land (boolean)
- If the right leg contact point has touched the land (boolean)


```python
Expand Down Expand Up @@ -433,7 +433,7 @@ model = PPO(
# TODO: Train it for 1,000,000 timesteps

# TODO: Specify file name for model and save the model to file
model_name = ""
model_name = "ppo-LunarLander-v2"
```

#### Solution
Expand Down
4 changes: 2 additions & 2 deletions units/en/unit1/rl-framework.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -83,11 +83,11 @@ The actions can come from a *discrete* or *continuous space*:

<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/mario.jpg" alt="Mario">
<figcaption>Again, in Super Mario Bros, we have only 5 possible actions: 4 directions and jumping</figcaption>
<figcaption>In Super Mario Bros, we have only 4 possible actions: left, right, up (jumping) and down (crouching).</figcaption>

</figure>

In Super Mario Bros, we have a finite set of actions since we have only 4 directions and jump.
Again, in Super Mario Bros, we have a finite set of actions since we have only 4 directions.

- *Continuous space*: the number of possible actions is **infinite**.

Expand Down
2 changes: 1 addition & 1 deletion units/en/unit1/two-methods.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ We have two types of policies:
</figure>

<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/mario.jpg" alt="Mario"/>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/policy-based.png" alt="Policy Based"/>
<figcaption>Given an initial state, our stochastic policy will output probability distributions over the possible actions at that state.</figcaption>
</figure>

Expand Down
8 changes: 7 additions & 1 deletion units/en/unit2/glossary.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ This is a community-created glossary. Contributions are welcomed!
### Among the value-based methods, we can find two main strategies

- **The state-value function.** For each state, the state-value function is the expected return if the agent starts in that state and follows the policy until the end.
- **The action-value function.** In contrast to the state-value function, the action-value calculates for each state and action pair the expected return if the agent starts in that state and takes an action. Then it follows the policy forever after.
- **The action-value function.** In contrast to the state-value function, the action-value calculates for each state and action pair the expected return if the agent starts in that state, takes that action, and then follows the policy forever after.

### Epsilon-greedy strategy:

Expand All @@ -32,6 +32,12 @@ This is a community-created glossary. Contributions are welcomed!
- **Off-policy algorithms:** A different policy is used at training time and inference time
- **On-policy algorithms:** The same policy is used during training and inference

### Monte Carlo and Temporal Difference learning strategies

- **Monte Carlo (MC):** Learning at the end of the episode. With Monte Carlo, we wait until the episode ends and then we update the value function (or policy function) from a complete episode.

- **Temporal Difference (TD):** Learning at each step. With Temporal Difference Learning, we update the value function (or policy function) at each step without requiring a complete episode.

If you want to improve the course, you can [open a Pull Request.](https://github.com/huggingface/deep-rl-class/pulls)

This glossary was made possible thanks to:
Expand Down
24 changes: 11 additions & 13 deletions units/en/unit2/hands-on.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

<CourseFloatingBanner classNames="absolute z-10 right-0 top-0"
notebooks={[
{label: "Google Colab", value: "https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit2/unit2.ipynb"}
{label: "Google Colab", value: "https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/notebooks/unit2/unit2.ipynb"}
]}
askForHelpUrl="http://hf.co/join/discord" />

Expand Down Expand Up @@ -93,16 +93,16 @@ Before diving into the notebook, you need to:

*Q-Learning* **is the RL algorithm that**:

- Trains *Q-Function*, an **action-value function** that encoded, in internal memory, by a *Q-table* **that contains all the state-action pair values.**
- Trains *Q-Function*, an **action-value function** that is encoded, in internal memory, by a *Q-table* **that contains all the state-action pair values.**

- Given a state and action, our Q-Function **will search the Q-table for the corresponding value.**

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function-2.jpg" alt="Q function" width="100%"/>

- When the training is done,**we have an optimal Q-Function, so an optimal Q-Table.**
- When the training is done, **we have an optimal Q-Function, so an optimal Q-Table.**

- And if we **have an optimal Q-function**, we
have an optimal policy, since we **know for, each state, the best action to take.**
have an optimal policy, since we **know for each state, the best action to take.**

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy" width="100%"/>

Expand Down Expand Up @@ -146,7 +146,8 @@ pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/

```bash
sudo apt-get update
apt install python-opengl ffmpeg xvfb
sudo apt-get install -y python3-opengl
apt install ffmpeg xvfb
pip3 install pyvirtualdisplay
```

Expand Down Expand Up @@ -246,7 +247,7 @@ print("Observation Space", env.observation_space)
print("Sample observation", env.observation_space.sample()) # Get a random observation
```

We see with `Observation Space Shape Discrete(16)` that the observation is an integer representing the **agent’s current position as current_row * nrows + current_col (where both the row and col start at 0)**.
We see with `Observation Space Shape Discrete(16)` that the observation is an integer representing the **agent’s current position as current_row * ncols + current_col (where both the row and col start at 0)**.

For example, the goal position in the 4x4 map can be calculated as follows: 3 * 4 + 3 = 15. The number of possible observations is dependent on the size of the map. **For example, the 4x4 map has 16 possible observations.**

Expand Down Expand Up @@ -352,7 +353,7 @@ def greedy_policy(Qtable, state):
return action
```

##Define the epsilon-greedy policy 🤖
## Define the epsilon-greedy policy 🤖

Epsilon-greedy is the training policy that handles the exploration/exploitation trade-off.

Expand Down Expand Up @@ -388,9 +389,9 @@ def epsilon_greedy_policy(Qtable, state, epsilon):
```python
def epsilon_greedy_policy(Qtable, state, epsilon):
# Randomly generate a number between 0 and 1
random_int = random.uniform(0, 1)
# if random_int > greater than epsilon --> exploitation
if random_int > epsilon:
random_num = random.uniform(0, 1)
# if random_num > greater than epsilon --> exploitation
if random_num > epsilon:
# Take the action with the highest value given a state
# np.argmax can be useful here
action = greedy_policy(Qtable, state)
Expand Down Expand Up @@ -716,13 +717,10 @@ def push_to_hub(repo_id, model, env, video_fps=1, local_repo_path="hub"):

## Usage

```python

model = load_from_hub(repo_id="{repo_id}", filename="q-learning.pkl")

# Don't forget to check if you need to add additional attributes (is_slippery=False etc)
env = gym.make(model["env_id"])
```
"""

evaluate_agent(env, model["max_steps"], model["n_eval_episodes"], model["qtable"], model["eval_seed"])
Expand Down
26 changes: 17 additions & 9 deletions units/en/unit2/mc-vs-td.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -57,18 +57,26 @@ For instance, if we train a state-value function using Monte Carlo:
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-4p.jpg" alt="Monte Carlo"/>


- We have a list of state, action, rewards, next_state, **we need to calculate the return \\(G{t}\\)**
- \\(G_t = R_{t+1} + R_{t+2} + R_{t+3} ...\\)
- \\(G_t = R_{t+1} + R_{t+2} + R_{t+3}…\\) (for simplicity we don’t discount the rewards).
- \\(G_t = 1 + 0 + 0 + 0+ 0 + 0 + 1 + 1 + 0 + 0\\)
- \\(G_t= 3\\)
- We can now update \\(V(S_0)\\):

- We have a list of state, action, rewards, next_state, **we need to calculate the return \\(G{t=0}\\)**

\\(G_t = R_{t+1} + R_{t+2} + R_{t+3} ...\\) (for simplicity, we don't discount the rewards)

\\(G_0 = R_{1} + R_{2} + R_{3}…\\)

\\(G_0 = 1 + 0 + 0 + 0 + 0 + 0 + 1 + 1 + 0 + 0\\)

\\(G_0 = 3\\)

- We can now compute the **new** \\(V(S_0)\\):

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-5.jpg" alt="Monte Carlo"/>

- New \\(V(S_0) = V(S_0) + lr * [G_t — V(S_0)]\\)
- New \\(V(S_0) = 0 + 0.1 * [3 – 0]\\)
- New \\(V(S_0) = 0.3\\)
\\(V(S_0) = V(S_0) + lr * [G_0 — V(S_0)]\\)

\\(V(S_0) = 0 + 0.1 * [3 – 0]\\)

\\(V(S_0) = 0.3\\)


<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-5p.jpg" alt="Monte Carlo"/>
Expand Down
Loading