huggingface-cn · innovation64 · Nov 30, 2023 · Dec 2, 2023 · Dec 2, 2023 · Dec 9, 2023
diff --git a/units/en/_toctree.yml b/units/en/_toctree.yml
@@ -122,6 +122,8 @@
     title: Diving deeper into policy-gradient
   - local: unit4/pg-theorem
     title: (Optional) the Policy Gradient Theorem
+  - local: unit4/glossary
+    title: Glossary
   - local: unit4/hands-on
     title: Hands-on
   - local: unit4/quiz
@@ -146,6 +148,8 @@
     title: Hands-on
   - local: unit5/bonus
     title: Bonus. Learn to create your own environments with Unity and MLAgents
+  - local: unit5/quiz
+    title: Quiz
   - local: unit5/conclusion
     title: Conclusion
 - title: Unit 6. Actor Critic methods with Robotics environments
@@ -157,7 +161,9 @@
   - local: unit6/advantage-actor-critic
     title: Advantage Actor Critic (A2C)
   - local: unit6/hands-on
-    title: Advantage Actor Critic (A2C) using Robotics Simulations with PyBullet and Panda-Gym 🤖
+    title: Advantage Actor Critic (A2C) using Robotics Simulations with Panda-Gym 🤖
+  - local: unit6/quiz
+    title: Quiz
   - local: unit6/conclusion
     title: Conclusion
   - local: unit6/additional-readings
@@ -174,6 +180,8 @@
     title: Self-Play
   - local: unit7/hands-on
     title: Let's train our soccer team to beat your classmates' teams (AI vs. AI)
+  - local: unit7/quiz
+    title: Quiz
   - local: unit7/conclusion
     title: Conclusion
   - local: unit7/additional-readings
@@ -210,6 +218,8 @@
     title: Model-Based Reinforcement Learning
   - local: unitbonus3/offline-online
     title: Offline vs. Online Reinforcement Learning
+  - local: unitbonus3/generalisation
+    title: Generalisation Reinforcement Learning
   - local: unitbonus3/rlhf
     title: Reinforcement Learning from Human Feedback
   - local: unitbonus3/decision-transformers
@@ -220,8 +230,12 @@
     title: (Automatic) Curriculum Learning for RL
   - local: unitbonus3/envs-to-try
     title: Interesting environments to try
+  - local: unitbonus3/learning-agents
+    title: An introduction to Unreal Learning Agents
   - local: unitbonus3/godotrl
     title: An Introduction to Godot RL
+  - local: unitbonus3/student-works
+    title: Students projects
   - local: unitbonus3/rl-documentation
     title: Brief introduction to RL documentation
 - title: Certification and congratulations

diff --git a/units/en/communication/certification.mdx b/units/en/communication/certification.mdx
@@ -3,8 +3,10 @@
 
 The certification process is **completely free**:
 
-- To get a *certificate of completion*: you need **to pass 80% of the assignments** before the end of July 2023.
-- To get a *certificate of excellence*: you need **to pass 100% of the assignments** before the end of July 2023.
+- To get a *certificate of completion*: you need **to pass 80% of the assignments**.
+- To get a *certificate of excellence*: you need **to pass 100% of the assignments**.
+
+There's **no deadlines, the course is self-paced**.
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/certification.jpg" alt="Course certification" width="100%"/>
 

diff --git a/units/en/unit0/discord101.mdx b/units/en/unit0/discord101.mdx
@@ -5,20 +5,18 @@ Although I don't know much about fetching sticks (yet), I know one or two things
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/huggy-logo.jpg" alt="Huggy Logo"/>
 
-Discord is a free chat platform. If you've used Slack, **it's quite similar**. There is a Hugging Face Community Discord server with 36000 members you can <a href="https://discord.gg/ydHrjt3WP5">join with a single click here</a>. So many humans to play with!
+Discord is a free chat platform. If you've used Slack, **it's quite similar**. There is a Hugging Face Community Discord server with 50000 members you can <a href="https://discord.gg/ydHrjt3WP5">join with a single click here</a>. So many humans to play with!
 
 Starting in Discord can be a bit intimidating, so let me take you through it.
 
-When you [sign-up to our Discord server](http://hf.co/join/discord), you'll choose your interests. Make sure to **click "Reinforcement Learning"**. 
+When you [sign-up to our Discord server](http://hf.co/join/discord), you'll choose your interests. Make sure to **click "Reinforcement Learning,"** and you'll get access to the Reinforcement Learning Category containing all the course-related channels. If you feel like joining even more channels, go for it! 🚀
 
 Then click next, you'll then get to **introduce yourself in the `#introduce-yourself` channel**.
 
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/discord2.jpg" alt="Discord"/>
 
-## So which channels are interesting to me? [[channels]]
-
-They are in the reinforcement learning lounge. **Don't forget to sign up to these channels** by clicking on 🤖 Reinforcement Learning in `role-assigment`.
+They are in the reinforcement learning category. **Don't forget to sign up to these channels** by clicking on 🤖 Reinforcement Learning in `role-assigment`.
 - `rl-announcements`: where we give the **lastest information about the course**.
 - `rl-discussions`: where you can **exchange about RL and share information**.
 - `rl-study-group`: where you can **ask questions and exchange with your classmates**.

diff --git a/units/en/unit0/introduction.mdx b/units/en/unit0/introduction.mdx
@@ -59,10 +59,11 @@ This is the course's syllabus:
 
 You can choose to follow this course either:
 
-- *To get a certificate of completion*: you need to complete 80% of the assignments before the end of July 2023.
-- *To get a certificate of honors*: you need to complete 100% of the assignments before the end of July 2023.
-- *As a simple audit*: you can participate in all challenges and do assignments if you want, but you have no deadlines.
+- *To get a certificate of completion*: you need to complete 80% of the assignments. 
+- *To get a certificate of honors*: you need to complete 100% of the assignments.
+- *As a simple audit*: you can participate in all challenges and do assignments if you want.
 
+There's **no deadlines, the course is self-paced**.
 Both paths **are completely free**.
 Whatever path you choose, we advise you **to follow the recommended pace to enjoy the course and challenges with your fellow classmates.**
 
@@ -72,8 +73,10 @@ You don't need to tell us which path you choose. **If you get more than 80% of t
 
 The certification process is **completely free**:
 
-- *To get a certificate of completion*: you need to complete 80% of the assignments before the end of July 2023.
-- *To get a certificate of honors*: you need to complete 100% of the assignments before the end of July 2023.
+- *To get a certificate of completion*: you need to complete 80% of the assignments.
+- *To get a certificate of honors*: you need to complete 100% of the assignments.
+
+Again, there's **no deadline** since the course is self paced. But our advice **is to follow the recommended pace section**.
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/certification.jpg" alt="Course certification" width="100%"/>
 
@@ -100,15 +103,8 @@ You need only 3 things:
 
 ## What is the recommended pace? [[recommended-pace]]
 
-We defined a plan that you can follow to keep up the pace of the course.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/pace1.jpg" alt="Course advice" width="100%"/>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/pace2.jpg" alt="Course advice" width="100%"/>
-
-
 Each chapter in this course is designed **to be completed in 1 week, with approximately 3-4 hours of work per week**. However, you can take as much time as necessary to complete the course. If you want to dive into a topic more in-depth, we'll provide additional resources to help you achieve that.
 
-
 ## Who are we [[who-are-we]]
 About the author:
 
@@ -120,7 +116,7 @@ About the team:
 - <a href="https://twitter.com/RisingSayak"> Sayak Paul</a> is a Developer Advocate Engineer at Hugging Face. He's interested in the area of representation learning (self-supervision, semi-supervision, model robustness). And he loves watching crime and action thrillers 🔪.
 
 
-## When do the challenges start? [[challenges]]
+## What are the challenges in this course? [[challenges]]
 
 In this new version of the course, you have two types of challenges:
 - [A leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) to compare your agent's performance to other classmates'.

diff --git a/units/en/unit0/setup.mdx b/units/en/unit0/setup.mdx
@@ -15,7 +15,7 @@ You can now sign up for our Discord Server. This is the place where you **can ch
 
 👉🏻 Join our discord server <a href="https://discord.gg/ydHrjt3WP5">here.</a>
 
-When you join, remember to introduce yourself in #introduce-yourself and sign-up for reinforcement channels in #role-assignments.
+When you join, remember to introduce yourself in #introduce-yourself and sign-up for reinforcement channels in #channels-and-roles.
 
 We have multiple RL-related channels:
 - `rl-announcements`: where we give the latest information about the course.

diff --git a/units/en/unit1/hands-on.mdx b/units/en/unit1/hands-on.mdx
@@ -5,7 +5,7 @@
 
       <CourseFloatingBanner classNames="absolute z-10 right-0 top-0"
       notebooks={[
-        {label: "Google Colab", value: "https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit1/unit1.ipynb"}
+        {label: "Google Colab", value: "https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/notebooks/unit1/unit1.ipynb"}
         ]}
         askForHelpUrl="http://hf.co/join/discord" />
 
@@ -282,7 +282,7 @@ env.close()
 
 ## Create the LunarLander environment 🌛 and understand how it works
 
-### [The environment 🎮](https://gymnasium.farama.org/environments/box2d/lunar_lander/)
+### The environment 🎮
 
 In this first tutorial, we’re going to train our agent, a [Lunar Lander](https://gymnasium.farama.org/environments/box2d/lunar_lander/), **to land correctly on the moon**. To do that, the agent needs to learn **to adapt its speed and position (horizontal, vertical, and angular) to land correctly.**
 
@@ -315,8 +315,8 @@ We see with `Observation Space Shape (8,)` that the observation is a vector of s
 - Vertical speed (y)
 - Angle
 - Angular speed
-- If the left leg contact point has touched the land
-- If the right leg contact point has touched the land
+- If the left leg contact point has touched the land (boolean)
+- If the right leg contact point has touched the land (boolean)
 
 
 ```python
@@ -433,7 +433,7 @@ model = PPO(
 # TODO: Train it for 1,000,000 timesteps
 
 # TODO: Specify file name for model and save the model to file
-model_name = ""
+model_name = "ppo-LunarLander-v2"
 ```
 
 #### Solution

diff --git a/units/en/unit1/rl-framework.mdx b/units/en/unit1/rl-framework.mdx
@@ -83,11 +83,11 @@ The actions can come from a *discrete* or *continuous space*:
 
 <figure>
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/mario.jpg" alt="Mario">
-<figcaption>Again, in Super Mario Bros, we have only 5 possible actions: 4 directions and jumping</figcaption>
+<figcaption>In Super Mario Bros, we have only 4 possible actions: left, right, up (jumping) and down (crouching).</figcaption>
 
 </figure>
 
-In Super Mario Bros, we have a finite set of actions since we have only 4 directions and jump.
+Again, in Super Mario Bros, we have a finite set of actions since we have only 4 directions.
 
 - *Continuous space*: the number of possible actions is **infinite**.
 

diff --git a/units/en/unit1/two-methods.mdx b/units/en/unit1/two-methods.mdx
@@ -54,7 +54,7 @@ We have two types of policies:
 </figure>
 
 <figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/mario.jpg" alt="Mario"/>
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/policy-based.png" alt="Policy Based"/>
 <figcaption>Given an initial state, our stochastic policy will output probability distributions over the possible actions at that state.</figcaption>
 </figure>
 

diff --git a/units/en/unit2/glossary.mdx b/units/en/unit2/glossary.mdx
@@ -11,7 +11,7 @@ This is a community-created glossary. Contributions are welcomed!
 ### Among the value-based methods, we can find two main strategies
 
 - **The state-value function.** For each state, the state-value function is the expected return if the agent starts in that state and follows the policy until the end.
-- **The action-value function.** In contrast to the state-value function, the action-value calculates for each state and action pair the expected return if the agent starts in that state and takes an action. Then it follows the policy forever after.
+- **The action-value function.** In contrast to the state-value function, the action-value calculates for each state and action pair the expected return if the agent starts in that state, takes that action, and then follows the policy forever after.
 
 ### Epsilon-greedy strategy:
 
@@ -32,6 +32,12 @@ This is a community-created glossary. Contributions are welcomed!
 - **Off-policy algorithms:** A different policy is used at training time and inference time
 - **On-policy algorithms:** The same policy is used during training and inference
 
+### Monte Carlo and Temporal Difference learning strategies
+
+- **Monte Carlo (MC):** Learning at the end of the episode. With Monte Carlo, we wait until the episode ends and then we update the value function (or policy function) from a complete episode.
+
+- **Temporal Difference (TD):** Learning at each step. With Temporal Difference Learning, we update the value function (or policy function) at each step without requiring a complete episode.
+
 If you want to improve the course, you can [open a Pull Request.](https://github.com/huggingface/deep-rl-class/pulls)
 
 This glossary was made possible thanks to:

diff --git a/units/en/unit2/hands-on.mdx b/units/en/unit2/hands-on.mdx
@@ -2,7 +2,7 @@
 
       <CourseFloatingBanner classNames="absolute z-10 right-0 top-0"
       notebooks={[
-        {label: "Google Colab", value: "https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit2/unit2.ipynb"}
+        {label: "Google Colab", value: "https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/notebooks/unit2/unit2.ipynb"}
         ]}
         askForHelpUrl="http://hf.co/join/discord" />
 
@@ -93,16 +93,16 @@ Before diving into the notebook, you need to:
 
 *Q-Learning* **is the RL algorithm that**:
 
-- Trains *Q-Function*, an **action-value function** that encoded, in internal memory, by a *Q-table* **that contains all the state-action pair values.**
+- Trains *Q-Function*, an **action-value function** that is encoded, in internal memory, by a *Q-table* **that contains all the state-action pair values.**
 
 - Given a state and action, our Q-Function **will search the Q-table for the corresponding value.**
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function-2.jpg" alt="Q function"  width="100%"/>
 
-- When the training is done,**we have an optimal Q-Function, so an optimal Q-Table.**
+- When the training is done, **we have an optimal Q-Function, so an optimal Q-Table.**
 
 - And if we **have an optimal Q-function**, we
-have an optimal policy, since we **know for, each state, the best action to take.**
+have an optimal policy, since we **know for each state, the best action to take.**
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy"  width="100%"/>
 
@@ -146,7 +146,8 @@ pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/
 
 ```bash
 sudo apt-get update
-apt install python-opengl ffmpeg xvfb
+sudo apt-get install -y python3-opengl
+apt install ffmpeg xvfb
 pip3 install pyvirtualdisplay
 ```
 
@@ -246,7 +247,7 @@ print("Observation Space", env.observation_space)
 print("Sample observation", env.observation_space.sample())  # Get a random observation
 ```
 
-We see with `Observation Space Shape Discrete(16)` that the observation is an integer representing the **agent’s current position as current_row * nrows + current_col (where both the row and col start at 0)**.
+We see with `Observation Space Shape Discrete(16)` that the observation is an integer representing the **agent’s current position as current_row * ncols + current_col (where both the row and col start at 0)**.
 
 For example, the goal position in the 4x4 map can be calculated as follows: 3 * 4 + 3 = 15. The number of possible observations is dependent on the size of the map. **For example, the 4x4 map has 16 possible observations.**
 
@@ -352,7 +353,7 @@ def greedy_policy(Qtable, state):
     return action
 ```
 
-##Define the epsilon-greedy policy 🤖
+## Define the epsilon-greedy policy 🤖
 
 Epsilon-greedy is the training policy that handles the exploration/exploitation trade-off.
 
@@ -388,9 +389,9 @@ def epsilon_greedy_policy(Qtable, state, epsilon):
 ```python
 def epsilon_greedy_policy(Qtable, state, epsilon):
     # Randomly generate a number between 0 and 1
-    random_int = random.uniform(0, 1)
-    # if random_int > greater than epsilon --> exploitation
-    if random_int > epsilon:
+    random_num = random.uniform(0, 1)
+    # if random_num > greater than epsilon --> exploitation
+    if random_num > epsilon:
         # Take the action with the highest value given a state
         # np.argmax can be useful here
         action = greedy_policy(Qtable, state)
@@ -716,13 +717,10 @@ def push_to_hub(repo_id, model, env, video_fps=1, local_repo_path="hub"):
 
   ## Usage
 
-  ```python
-
   model = load_from_hub(repo_id="{repo_id}", filename="q-learning.pkl")
 
   # Don't forget to check if you need to add additional attributes (is_slippery=False etc)
   env = gym.make(model["env_id"])
-  ```
   """
 
     evaluate_agent(env, model["max_steps"], model["n_eval_episodes"], model["qtable"], model["eval_seed"])

diff --git a/units/en/unit2/mc-vs-td.mdx b/units/en/unit2/mc-vs-td.mdx
@@ -57,18 +57,26 @@ For instance, if we train a state-value function using Monte Carlo:
   <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-4p.jpg" alt="Monte Carlo"/>
 
 
-- We have a list of state, action, rewards, next_state, **we need to calculate the return \\(G{t}\\)**
-- \\(G_t = R_{t+1} + R_{t+2} + R_{t+3} ...\\)
-- \\(G_t = R_{t+1} + R_{t+2} + R_{t+3}…\\) (for simplicity we don’t discount the rewards).
-- \\(G_t = 1 + 0 + 0 + 0+ 0 + 0 + 1 + 1 + 0 + 0\\)
-- \\(G_t= 3\\)
-- We can now update \\(V(S_0)\\):
+
+- We have a list of state, action, rewards, next_state, **we need to calculate the return \\(G{t=0}\\)**
+
+\\(G_t = R_{t+1} + R_{t+2} + R_{t+3} ...\\) (for simplicity, we don't discount the rewards)
+
+\\(G_0 = R_{1} + R_{2} + R_{3}…\\)
+
+\\(G_0 = 1 + 0 + 0 + 0 + 0 + 0 + 1 + 1 + 0 + 0\\)
+
+\\(G_0 = 3\\)
+
+- We can now compute the **new** \\(V(S_0)\\):
 
   <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-5.jpg" alt="Monte Carlo"/>
 
-- New \\(V(S_0) = V(S_0) + lr * [G_t — V(S_0)]\\)
-- New \\(V(S_0) = 0 + 0.1 * [3 – 0]\\)
-- New \\(V(S_0) = 0.3\\)
+\\(V(S_0) = V(S_0) + lr * [G_0 — V(S_0)]\\)
+
+\\(V(S_0) = 0 + 0.1 * [3 – 0]\\)
+
+\\(V(S_0) = 0.3\\)
 
 
   <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-5p.jpg" alt="Monte Carlo"/>