From fd0a7da6c3650ff9f1cf14d002f52926075a9fda Mon Sep 17 00:00:00 2001 From: Boris Yangel Date: Thu, 13 Feb 2025 18:29:06 +0000 Subject: [PATCH 1/3] Create one-step-lookahead.md --- one-step-lookahead.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) create mode 100644 one-step-lookahead.md diff --git a/one-step-lookahead.md b/one-step-lookahead.md new file mode 100644 index 0000000..37b71eb --- /dev/null +++ b/one-step-lookahead.md @@ -0,0 +1,13 @@ +# Some Interesting Properties of 1-step K-sample Lookahead Operator + +1-step K-sample lookahead operator (which we will simply call lookahead operator from now on) in reinforcement learning is a sample-based policy improvement operator that works by first sampling $K$ action candidates from the base policy and then proceeding with the action with largest q-value: + +$$L_{K, \pi}(a \mid s) = E_{a_1, \ldots, a_K \sim \pi(a \mid s)}\Bbb{1}\left[a = \arg \max_{a' \in \\{a_1, \ldots, a_K\\}} Q(s, a')\right].$$ + +It can be thought of as an imperfect approximation of the optimal max-q operator. It is very useful when dealing with large action spaces, where the exact maximum of the q-function over all actions cannot be computed. In this note, we will list various properties of this operator that might be useful when using it in practice. + +### Lookahead operator is indeed a policy improvement operator + +This is trivial to show using the policy improvement theorem: + +$$E_{a \sim L_{K, \pi}(a \mid s)} Q(s, a} = E_{a_1, \ldots, a_K \sim \pi(a \mid s)}\Bbb{1}\left[a = \arg \max_{a' \in \\{a_1, \ldots, a_K\\}} Q(s, a')\right] Q(s, a)$$ From 464e145a71ec6935fc4fc99a22c93ae23753d0f5 Mon Sep 17 00:00:00 2001 From: Boris Yangel Date: Thu, 13 Feb 2025 19:13:32 +0000 Subject: [PATCH 2/3] Update one-step-lookahead.md WIP --- one-step-lookahead.md | 35 +++++++++++++++++++++++++++++++++-- 1 file changed, 33 insertions(+), 2 deletions(-) diff --git a/one-step-lookahead.md b/one-step-lookahead.md index 37b71eb..138ac65 100644 --- a/one-step-lookahead.md +++ b/one-step-lookahead.md @@ -8,6 +8,37 @@ It can be thought of as an imperfect approximation of the optimal max-q operator ### Lookahead operator is indeed a policy improvement operator -This is trivial to show using the policy improvement theorem: +Follows from the policy improvement theorem: + +$$E_{a \sim L_{K, \pi}(a \mid s)} Q(s, a) = E_{a_1, \ldots, a_K \sim \pi(a \mid s)} \max_{a' \in \\{a_1, \ldots, a_K\\}} Q(s, a') \geq E_{a \sim \pi(a \mid s)} Q(s, a) = V_{\pi}.$$ + + +### Repeated application of lookahead increases the effective number of candidates ### + +Suppose we define a new policy + +$$\pi^{\ast}(a \mid s) = L_{K, \pi}(a \mid s).$$ + +What is then $L_{K, \pi^{\ast}}(a \mid s)$, i.e. the result of applying the lookahead operator twice? One way to think about it is this: one invocation of $\pi^{\ast}$ is effectively samping $K$ action candidates and returning the best of them. Lookahead operator $L_{K, \pi^{\ast}}(a \mid s)$ invokes $\pi^{\ast}$ $K$ times, so it effectively samples $K$ groups of $K$ candidates each, selects the best candidate in each group and then selects the best action across all $K$ groups. Since the same q-function is used to select both inside groups and across groups, this procedure is equivalent to simply doing lookahead with $K^2$ candidates. + +Applying another lookahead on top of this policy will be equivalent to doing a single lookahead with $K^3$ candidates and so on. In other words, reaching the effective power of max-q improvement operator in an action space with $N$ actions requires just $O(log N)$ lookahead compositions, which is a very reasonable number even for large action spaces. + + +### Lookahead-based Reinforcement Learning + +The above property suggests a straightforward RL algorithm: + +1. Evaluate the current policy $\pi$ to get $q_{\pi}(s, a)$, e.g. by sampling some trajectories from $\pi$ and then training a model to predict return-to-go. +2. Distill $L_{K, \pi}(a \mid s)$ into a new policy $\pi^{\ast}$, e.g. by sampling some trajectories from $L$ and then applying supervised training on this data. +3. Set $\pi$ to $\pi^{\ast}$ and go to the first step unless done. + +This algorithm +* Converges to the optimal policy in MDPs and does it fast even in large action spaces. +* Does not require to explicitly maximize the Q-function over actions, as this is covered by the repeated application of lookahead. +* Is very robust, as it only uses supervised learning as a learning subroutine. No sketchy non-stationary loss functions! + +#### Repeated Policy Evaluation is Not Necessary + +Perhaps the most surprising property of the algorithm is that instead of re-evaluating the policy + -$$E_{a \sim L_{K, \pi}(a \mid s)} Q(s, a} = E_{a_1, \ldots, a_K \sim \pi(a \mid s)}\Bbb{1}\left[a = \arg \max_{a' \in \\{a_1, \ldots, a_K\\}} Q(s, a')\right] Q(s, a)$$ From d85eea292448445967bbc228ea48276fd09502fa Mon Sep 17 00:00:00 2001 From: Boris Yangel Date: Thu, 13 Feb 2025 19:32:16 +0000 Subject: [PATCH 3/3] Update one-step-lookahead.md WIP --- one-step-lookahead.md | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/one-step-lookahead.md b/one-step-lookahead.md index 138ac65..4baa8a6 100644 --- a/one-step-lookahead.md +++ b/one-step-lookahead.md @@ -19,26 +19,28 @@ Suppose we define a new policy $$\pi^{\ast}(a \mid s) = L_{K, \pi}(a \mid s).$$ -What is then $L_{K, \pi^{\ast}}(a \mid s)$, i.e. the result of applying the lookahead operator twice? One way to think about it is this: one invocation of $\pi^{\ast}$ is effectively samping $K$ action candidates and returning the best of them. Lookahead operator $L_{K, \pi^{\ast}}(a \mid s)$ invokes $\pi^{\ast}$ $K$ times, so it effectively samples $K$ groups of $K$ candidates each, selects the best candidate in each group and then selects the best action across all $K$ groups. Since the same q-function is used to select both inside groups and across groups, this procedure is equivalent to simply doing lookahead with $K^2$ candidates. +What is then $L_{K, \pi^{\ast}}(a \mid s)$, i.e. the result of chaining two applications of the lookahead operator? Here is one way to think about it: one invocation of $\pi^{\ast}$ is effectively samping $K$ action candidates and returning the best of them. Lookahead operator $L_{K, \pi^{\ast}}(a \mid s)$ invokes $\pi^{\ast}$ $K$ times, so it effectively samples $K$ groups of $K$ candidates each, selects the best candidate in each group and then selects the best action across all $K$ groups. Since the same q-function is used to select both inside groups and across groups, this procedure is equivalent to simply doing lookahead with $K^2$ candidates. -Applying another lookahead on top of this policy will be equivalent to doing a single lookahead with $K^3$ candidates and so on. In other words, reaching the effective power of max-q improvement operator in an action space with $N$ actions requires just $O(log N)$ lookahead compositions, which is a very reasonable number even for large action spaces. +Stacking another lookahead on top will be equivalent to doing a single lookahead with $K^3$ candidates and so on. In other words, reaching the effective power of max-q improvement operator in an action space with $N$ actions requires just $O(log_K N)$ lookahead compositions, which is a very reasonable number even for very large action spaces. ### Lookahead-based Reinforcement Learning The above property suggests a straightforward RL algorithm: -1. Evaluate the current policy $\pi$ to get $q_{\pi}(s, a)$, e.g. by sampling some trajectories from $\pi$ and then training a model to predict return-to-go. -2. Distill $L_{K, \pi}(a \mid s)$ into a new policy $\pi^{\ast}$, e.g. by sampling some trajectories from $L$ and then applying supervised training on this data. +1. Evaluate the current policy $\pi$ to get $q_{\pi}(s, a)$, + * e.g. by sampling some trajectories from $\pi$ and then training a model to predict return-to-go. +2. Distill $L_{K, \pi}(a \mid s)$ into a new policy $\pi^{\ast}$, + * e.g. by sampling some trajectories from $L$ and then applying supervised training on this data. 3. Set $\pi$ to $\pi^{\ast}$ and go to the first step unless done. This algorithm -* Converges to the optimal policy in MDPs and does it fast even in large action spaces. -* Does not require to explicitly maximize the Q-function over actions, as this is covered by the repeated application of lookahead. +* Converges to the optimal policy in MDPs and does it fast even in large action spaces (assuming that Q can be approximated well). +* Does not require to explicitly maximize the Q-function over all actions, as this is covered by the repeated application of lookahead. * Is very robust, as it only uses supervised learning as a learning subroutine. No sketchy non-stationary loss functions! #### Repeated Policy Evaluation is Not Necessary -Perhaps the most surprising property of the algorithm is that instead of re-evaluating the policy - +Perhaps the most surprising property of the above algorithm is that re-evaluating the policy each time it was updated is not necessary: while $Q_{\pi}$ is not equal to $Q_{\pi^{\ast}}$ (which is expected: $\pi^{\ast}$ is an improvement over $\pi$), these q-functions rank actions identically due to constraints imposed on $\pi^{\ast}$ by the fact that it's not an arbitrary policy, but a lookahead policy made using $\pi$ and $Q_{\pi}$. In practice, however, it would still make sense to re-train $Q$ occasionally because as $\pi$ evolves, optimization emphasis shifts to the regions of state space where the original approximation of $Q$ was likely poor. +Let us now prove that $Q_{\pi}$ and $Q_{\pi^{\ast}}$ rank actions identically.