From fd0a7da6c3650ff9f1cf14d002f52926075a9fda Mon Sep 17 00:00:00 2001
From: Boris Yangel <boris.jangel@gmail.com>
Date: Thu, 13 Feb 2025 18:29:06 +0000
Subject: [PATCH 1/3] Create one-step-lookahead.md

---
 one-step-lookahead.md | 13 +++++++++++++
 1 file changed, 13 insertions(+)
 create mode 100644 one-step-lookahead.md

diff --git a/one-step-lookahead.md b/one-step-lookahead.md
new file mode 100644
index 0000000..37b71eb
--- /dev/null
+++ b/one-step-lookahead.md
@@ -0,0 +1,13 @@
+# Some Interesting Properties of 1-step K-sample Lookahead Operator
+
+1-step K-sample lookahead operator (which we will simply call lookahead operator from now on) in reinforcement learning is a sample-based policy improvement operator that works by first sampling $K$ action candidates from the base policy and then proceeding with the action with largest q-value:
+
+$$L_{K, \pi}(a \mid s) = E_{a_1, \ldots, a_K \sim \pi(a \mid s)}\Bbb{1}\left[a = \arg \max_{a' \in \\{a_1, \ldots, a_K\\}} Q(s, a')\right].$$
+
+It can be thought of as an imperfect approximation of the optimal max-q operator. It is very useful when dealing with large action spaces, where the exact maximum of the q-function over all actions cannot be computed. In this note, we will list various properties of this operator that might be useful when using it in practice.
+
+### Lookahead operator is indeed a policy improvement operator
+
+This is trivial to show using the policy improvement theorem:
+
+$$E_{a \sim L_{K, \pi}(a \mid s)} Q(s, a} = E_{a_1, \ldots, a_K \sim \pi(a \mid s)}\Bbb{1}\left[a = \arg \max_{a' \in \\{a_1, \ldots, a_K\\}} Q(s, a')\right] Q(s, a)$$

From 464e145a71ec6935fc4fc99a22c93ae23753d0f5 Mon Sep 17 00:00:00 2001
From: Boris Yangel <boris.jangel@gmail.com>
Date: Thu, 13 Feb 2025 19:13:32 +0000
Subject: [PATCH 2/3] Update one-step-lookahead.md

WIP
---
 one-step-lookahead.md | 35 +++++++++++++++++++++++++++++++++--
 1 file changed, 33 insertions(+), 2 deletions(-)

diff --git a/one-step-lookahead.md b/one-step-lookahead.md
index 37b71eb..138ac65 100644
--- a/one-step-lookahead.md
+++ b/one-step-lookahead.md
@@ -8,6 +8,37 @@ It can be thought of as an imperfect approximation of the optimal max-q operator
 
 ### Lookahead operator is indeed a policy improvement operator
 
-This is trivial to show using the policy improvement theorem:
+Follows from the policy improvement theorem:
+
+$$E_{a \sim L_{K, \pi}(a \mid s)} Q(s, a) = E_{a_1, \ldots, a_K \sim \pi(a \mid s)} \max_{a' \in \\{a_1, \ldots, a_K\\}} Q(s, a') \geq E_{a \sim \pi(a \mid s)} Q(s, a) = V_{\pi}.$$
+
+
+### Repeated application of lookahead increases the effective number of candidates ###
+
+Suppose we define a new policy
+
+$$\pi^{\ast}(a \mid s) = L_{K, \pi}(a \mid s).$$
+
+What is then $L_{K, \pi^{\ast}}(a \mid s)$, i.e. the result of applying the lookahead operator twice? One way to think about it is this: one invocation of $\pi^{\ast}$ is effectively samping $K$ action candidates and returning the best of them. Lookahead operator $L_{K, \pi^{\ast}}(a \mid s)$ invokes $\pi^{\ast}$ $K$ times, so it effectively samples $K$ groups of $K$ candidates each, selects the best candidate in each group and then selects the best action across all $K$ groups. Since the same q-function is used to select both inside groups and across groups, this procedure is equivalent to simply doing lookahead with $K^2$ candidates.
+
+Applying another lookahead on top of this policy will be equivalent to doing a single lookahead with $K^3$ candidates and so on. In other words, reaching the effective power of max-q improvement operator in an action space with $N$ actions requires just $O(log N)$ lookahead compositions, which is a very reasonable number even for large action spaces.
+
+
+### Lookahead-based Reinforcement Learning
+
+The above property suggests a straightforward RL algorithm:
+
+1. Evaluate the current policy $\pi$ to get $q_{\pi}(s, a)$, e.g. by sampling some trajectories from $\pi$ and then training a model to predict return-to-go.
+2. Distill $L_{K, \pi}(a \mid s)$ into a new policy $\pi^{\ast}$, e.g. by sampling some trajectories from $L$ and then applying supervised training on this data.
+3. Set $\pi$ to $\pi^{\ast}$ and go to the first step unless done.
+
+This algorithm
+* Converges to the optimal policy in MDPs and does it fast even in large action spaces.
+* Does not require to explicitly maximize the Q-function over actions, as this is covered by the repeated application of lookahead.
+* Is very robust, as it only uses supervised learning as a learning subroutine. No sketchy non-stationary loss functions!
+
+#### Repeated Policy Evaluation is Not Necessary
+
+Perhaps the most surprising property of the algorithm is that instead of re-evaluating the policy 
+
 
-$$E_{a \sim L_{K, \pi}(a \mid s)} Q(s, a} = E_{a_1, \ldots, a_K \sim \pi(a \mid s)}\Bbb{1}\left[a = \arg \max_{a' \in \\{a_1, \ldots, a_K\\}} Q(s, a')\right] Q(s, a)$$

From d85eea292448445967bbc228ea48276fd09502fa Mon Sep 17 00:00:00 2001
From: Boris Yangel <boris.jangel@gmail.com>
Date: Thu, 13 Feb 2025 19:32:16 +0000
Subject: [PATCH 3/3] Update one-step-lookahead.md

WIP
---
 one-step-lookahead.md | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/one-step-lookahead.md b/one-step-lookahead.md
index 138ac65..4baa8a6 100644
--- a/one-step-lookahead.md
+++ b/one-step-lookahead.md
@@ -19,26 +19,28 @@ Suppose we define a new policy
 
 $$\pi^{\ast}(a \mid s) = L_{K, \pi}(a \mid s).$$
 
-What is then $L_{K, \pi^{\ast}}(a \mid s)$, i.e. the result of applying the lookahead operator twice? One way to think about it is this: one invocation of $\pi^{\ast}$ is effectively samping $K$ action candidates and returning the best of them. Lookahead operator $L_{K, \pi^{\ast}}(a \mid s)$ invokes $\pi^{\ast}$ $K$ times, so it effectively samples $K$ groups of $K$ candidates each, selects the best candidate in each group and then selects the best action across all $K$ groups. Since the same q-function is used to select both inside groups and across groups, this procedure is equivalent to simply doing lookahead with $K^2$ candidates.
+What is then $L_{K, \pi^{\ast}}(a \mid s)$, i.e. the result of chaining two applications of the lookahead operator? Here is one way to think about it: one invocation of $\pi^{\ast}$ is effectively samping $K$ action candidates and returning the best of them. Lookahead operator $L_{K, \pi^{\ast}}(a \mid s)$ invokes $\pi^{\ast}$ $K$ times, so it effectively samples $K$ groups of $K$ candidates each, selects the best candidate in each group and then selects the best action across all $K$ groups. Since the same q-function is used to select both inside groups and across groups, this procedure is equivalent to simply doing lookahead with $K^2$ candidates.
 
-Applying another lookahead on top of this policy will be equivalent to doing a single lookahead with $K^3$ candidates and so on. In other words, reaching the effective power of max-q improvement operator in an action space with $N$ actions requires just $O(log N)$ lookahead compositions, which is a very reasonable number even for large action spaces.
+Stacking another lookahead on top will be equivalent to doing a single lookahead with $K^3$ candidates and so on. In other words, reaching the effective power of max-q improvement operator in an action space with $N$ actions requires just $O(log_K N)$ lookahead compositions, which is a very reasonable number even for very large action spaces.
 
 
 ### Lookahead-based Reinforcement Learning
 
 The above property suggests a straightforward RL algorithm:
 
-1. Evaluate the current policy $\pi$ to get $q_{\pi}(s, a)$, e.g. by sampling some trajectories from $\pi$ and then training a model to predict return-to-go.
-2. Distill $L_{K, \pi}(a \mid s)$ into a new policy $\pi^{\ast}$, e.g. by sampling some trajectories from $L$ and then applying supervised training on this data.
+1. Evaluate the current policy $\pi$ to get $q_{\pi}(s, a)$,
+    * e.g. by sampling some trajectories from $\pi$ and then training a model to predict return-to-go.
+2. Distill $L_{K, \pi}(a \mid s)$ into a new policy $\pi^{\ast}$,
+    * e.g. by sampling some trajectories from $L$ and then applying supervised training on this data.
 3. Set $\pi$ to $\pi^{\ast}$ and go to the first step unless done.
 
 This algorithm
-* Converges to the optimal policy in MDPs and does it fast even in large action spaces.
-* Does not require to explicitly maximize the Q-function over actions, as this is covered by the repeated application of lookahead.
+* Converges to the optimal policy in MDPs and does it fast even in large action spaces (assuming that Q can be approximated well).
+* Does not require to explicitly maximize the Q-function over all actions, as this is covered by the repeated application of lookahead.
 * Is very robust, as it only uses supervised learning as a learning subroutine. No sketchy non-stationary loss functions!
 
 #### Repeated Policy Evaluation is Not Necessary
 
-Perhaps the most surprising property of the algorithm is that instead of re-evaluating the policy 
-
+Perhaps the most surprising property of the above algorithm is that re-evaluating the policy each time it was updated is not necessary: while $Q_{\pi}$ is not equal to $Q_{\pi^{\ast}}$ (which is expected: $\pi^{\ast}$ is an improvement over $\pi$), these q-functions rank actions identically due to constraints imposed on $\pi^{\ast}$ by the fact that it's not an arbitrary policy, but a lookahead policy made using $\pi$ and $Q_{\pi}$. In practice, however, it would still make sense to re-train $Q$ occasionally because as $\pi$ evolves, optimization emphasis shifts to the regions of state space where the original approximation of $Q$ was likely poor.
 
+Let us now prove that $Q_{\pi}$ and $Q_{\pi^{\ast}}$ rank actions identically.