diff --git a/docs/course_notes/policy-based.md b/docs/course_notes/policy-based.md index 9a59f96a..2b4d9111 100644 --- a/docs/course_notes/policy-based.md +++ b/docs/course_notes/policy-based.md @@ -1 +1,251 @@ -# Policy-Based Methods +# Introduction + +Reinforcement Learning (RL) focuses on training an agent to interact +with an environment by learning a policy $\pi_{\theta}(a | s)$ that +maximizes the cumulative reward. Policy gradient methods are a class of +algorithms that directly optimize the policy by adjusting the parameters +$\theta$ via gradient ascent. + +## Why Policy Gradient Methods? + +Unlike value-based methods (e.g., Q-learning), which rely on estimating +value functions, policy gradient methods: + +- Can naturally handle stochastic policies, which are crucial in + environments requiring exploration. + +- Work well in continuous action spaces, where discrete action methods + become infeasible. + +- Can directly optimize differentiable policy representations, such as + neural networks. + +# Deriving the Policy Gradient Theorem + +The Policy Gradient Theorem provides a fundamental result in RL, +allowing us to express the gradient of the expected return $J(\theta)$ +in terms of the policy function. + +## Expected Return and Gradient + +The objective in RL is to maximize the expected return: + +$$J(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta}} \left[ \sum_{t=0}^{T} \gamma^t R_t \right],$$ + +where $\tau = (s_0, a_0, s_1, a_1, ...)$ represents a trajectory sampled +from the policy. + +The gradient of $J(\theta)$ is: + +$$\nabla_{\theta} J(\theta) = \nabla_{\theta} \mathbb{E}_{\tau \sim \pi_{\theta}} \left[ \sum_{t=0}^{T} \gamma^t R_t \right].$$ + +## Likelihood Ratio Trick + +Since the expectation is taken over trajectories sampled from +$\pi_{\theta}$, we apply the likelihood ratio trick: + +$$\nabla_{\theta} J(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta}} \left[ \sum_{t=0}^{T} \gamma^t R_t \nabla_{\theta} \log P(\tau) \right].$$ + +Using the probability of a trajectory: + +$$P(\tau) = P(s_0) \prod_{t=0}^{T} \pi_{\theta}(a_t | s_t) P(s_{t+1} | s_t, a_t),$$ + +the log-derivative simplifies to: + +$$\nabla_{\theta} \log P(\tau) = \sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t | s_t).$$ + +Thus, the policy gradient reduces to: + +$$\nabla_{\theta} J(\theta) = \mathbb{E}_{\pi_{\theta}} \left[ \sum_{t=0}^{T} G_t \nabla_{\theta} \log \pi_{\theta}(a_t | s_t) \right],$$ + +where: + +$$G_t = \sum_{k=t}^{T} \gamma^{k-t} R_k.$$ + +This is the policy gradient theorem, which forms the basis for +REINFORCE. + +# Continuous Action Spaces + +For continuous action spaces, we typically use a Gaussian distribution: + +$$\pi_{\theta}(a | s) = \mathcal{N}(\mu_{\theta}(s), \sigma_{\theta}^2).$$ + +The log-likelihood of the Gaussian policy is: + +$$\log \pi_{\theta}(a | s) = -\frac{(a - \mu_{\theta}(s))^2}{2\sigma_{\theta}^2} - \log (\sqrt{2\pi} \sigma_{\theta}).$$ + +Thus, the policy gradient update is: + +$$\nabla_{\theta} J(\theta) = \mathbb{E}_{\pi_{\theta}} \left[ \left( \frac{a - \mu_{\theta}(s)}{\sigma_{\theta}^2} \right) \nabla_{\theta} \mu_{\theta}(s) G_t \right].$$ + +# The REINFORCE Algorithm + +## Algorithm Overview + +The REINFORCE algorithm is a Monte Carlo policy gradient method that +uses complete episode returns to estimate the policy gradient. + +**Steps of REINFORCE:** + +1. **Initialize** policy parameters $\theta$. + +2. **Collect an episode**: Run the policy $\pi_{\theta}$ and store + $(s_t, a_t, r_t)$ for all time steps $t$. + +3. **Compute returns**: For each time step, compute: + + $$G_t = \sum_{k=t}^{T} \gamma^{k-t} R_k.$$ + +4. **Policy Update**: Update the parameters: + + $$\theta \leftarrow \theta + \alpha \sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t | s_t) G_t.$$ + +5. **Repeat** for multiple episodes. + +## Challenges and Variance Reduction + +**Baseline Subtraction:** Using a baseline $b(s_t)$ reduces variance: + +$$\nabla_{\theta} J(\theta) = \mathbb{E}_{\pi_{\theta}} \left[ \nabla_{\theta} \log \pi_{\theta}(a_t | s_t) (G_t - b(s_t)) \right].$$ + +A common choice is: + +$$b(s_t) = V^{\pi}(s_t), \quad A(s_t, a_t) = G_t - V^{\pi}(s_t).$$ + +## Entropy Regularization + +To encourage exploration, we introduce entropy regularization: + +$$J_{\text{entropy}}(\theta) = J(\theta) + \beta H(\pi_{\theta}),$$ + +where: + +$$H(\pi_{\theta}) = - \sum_{a} \pi_{\theta}(a | s) \log \pi_{\theta}(a | s).$$ + +## Natural Policy Gradient + +Instead of using vanilla gradient ascent, we use the natural gradient: + +$$\nabla_{\theta}^{\text{nat}} J(\theta) = F^{-1} \nabla_{\theta} J(\theta),$$ + +where $F$ is the Fisher Information Matrix. + +# Bias in Policy Gradient Methods + +Bias in RL occurs when an estimator systematically deviates from the +true value. In policy gradient methods, bias arises due to function +approximation, reward estimation, or gradient computation. + +## Sources of Bias + +- **Function Approximation Bias:** When neural networks or linear + functions approximate the policy or value function, they may + introduce systematic errors. For instance, underestimating a value + function can lead to suboptimal policy updates. + +- **Reward Clipping or Discounting:** Some algorithms use clipped + rewards or high discount factors ($\gamma$), which introduce bias in + estimating long-term returns. + +- **Baseline Approximation:** The use of an estimated baseline (e.g., + $V^{\pi}(s)$) in variance reduction can introduce bias if the + baseline is poorly estimated. + +## Example of Bias + +Consider a self-driving car learning to optimize fuel efficiency. If the +reward function overemphasizes immediate fuel consumption rather than +long-term efficiency, the learned policy may prioritize short-term gains +while missing globally optimal strategies, leading to biased learning. + +# Variance in Policy Gradient Methods + +Variance in policy gradient estimates refers to the fluctuation in +gradient estimates across different training episodes. High variance can +lead to instability and slow convergence. + +## Sources of Variance + +- **Monte Carlo Estimation:** The REINFORCE algorithm computes + gradients based on entire episodes, leading to high variance due to + random sampling of trajectories. + +- **Stochastic Policy Outputs:** Policies represented as probability + distributions (e.g., Gaussian policies) can introduce randomness in + gradient updates. + +- **Exploration Strategies:** Random action selection, such as using + softmax or epsilon-greedy exploration, increases variability in + learning updates. + +## Example of Variance + +Consider a robotic arm learning to pick up objects. Due to high +variance, in some training episodes it may accidentally grasp the object +correctly, while in others it fails due to slight variations in initial +positioning. These fluctuations in learning updates slow down +convergence. + +# Monte Carlo Estimators in Reinforcement Learning + +A Monte Carlo estimator is a method used to approximate the expected +value of a function $f(X)$ over a random variable $X$ with a given +probability distribution $p(X)$. The true expectation is: + +$$E[f(X)] = \int f(x) p(x) \, dx$$ + +However, directly computing this integral may be complex. Instead, we +use Monte Carlo estimation by drawing $N$ independent samples +$X_1, X_2, \dots, X_N$ from $p(X)$ and computing: + +$$\hat{\mu}_{MC} = \frac{1}{N} \sum_{i=1}^{N} f(X_i)$$ + +This estimator provides an approximation to the true expectation +$E[f(X)]$. + +By the law of large numbers (LLN), as $N \to \infty$, we have: + +$$\hat{X}_N \to \mathbb{E}[X] \quad \text{(almost surely)}$$ + +Monte Carlo methods are commonly used in RL for estimating expected +rewards, state-value functions, and action-value functions. + +# Biased vs. Unbiased Estimation + +The biased formula for the sample variance $S^2$ is given by: + +$$S^2_{\text{biased}} = \frac{1}{n} \sum_{i=1}^{n} (X_i - \overline{X})^2$$ + +This is an underestimation of the true population variance $\sigma^2$ +because it does not account for the degrees of freedom in estimation. +Instead, the unbiased estimator is: + +$$S^2_{\text{unbiased}} = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \overline{X})^2.$$ + +This unbiased estimator correctly accounts for variance in small sample +sizes, ensuring $\mathbb{E}[S^2_{\text{unbiased}}] = \sigma^2$. + +# Balancing Bias and Variance + +Reducing bias often increases variance, and vice versa. The goal is to +find a balance between the two. + +## Strategies for Bias Reduction + +- Using more expressive function approximators (e.g., deeper neural + networks). + +- Improving reward estimation techniques (e.g., using learned value + functions). + +## Strategies for Variance Reduction + +- **Baseline Subtraction:** Introducing a baseline function $b(s_t)$ + to reduce variance without affecting bias. + +- **Reward-to-Go:** Instead of using the full return, using the + reward-to-go estimator reduces variance. + +- **Actor-Critic Methods:** Combining value function estimation with + policy updates stabilizes learning.