Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 46 additions & 0 deletions one-step-lookahead.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Some Interesting Properties of 1-step K-sample Lookahead Operator

1-step K-sample lookahead operator (which we will simply call lookahead operator from now on) in reinforcement learning is a sample-based policy improvement operator that works by first sampling $K$ action candidates from the base policy and then proceeding with the action with largest q-value:

$$L_{K, \pi}(a \mid s) = E_{a_1, \ldots, a_K \sim \pi(a \mid s)}\Bbb{1}\left[a = \arg \max_{a' \in \\{a_1, \ldots, a_K\\}} Q(s, a')\right].$$

It can be thought of as an imperfect approximation of the optimal max-q operator. It is very useful when dealing with large action spaces, where the exact maximum of the q-function over all actions cannot be computed. In this note, we will list various properties of this operator that might be useful when using it in practice.

### Lookahead operator is indeed a policy improvement operator

Follows from the policy improvement theorem:

$$E_{a \sim L_{K, \pi}(a \mid s)} Q(s, a) = E_{a_1, \ldots, a_K \sim \pi(a \mid s)} \max_{a' \in \\{a_1, \ldots, a_K\\}} Q(s, a') \geq E_{a \sim \pi(a \mid s)} Q(s, a) = V_{\pi}.$$


### Repeated application of lookahead increases the effective number of candidates ###

Suppose we define a new policy

$$\pi^{\ast}(a \mid s) = L_{K, \pi}(a \mid s).$$

What is then $L_{K, \pi^{\ast}}(a \mid s)$, i.e. the result of chaining two applications of the lookahead operator? Here is one way to think about it: one invocation of $\pi^{\ast}$ is effectively samping $K$ action candidates and returning the best of them. Lookahead operator $L_{K, \pi^{\ast}}(a \mid s)$ invokes $\pi^{\ast}$ $K$ times, so it effectively samples $K$ groups of $K$ candidates each, selects the best candidate in each group and then selects the best action across all $K$ groups. Since the same q-function is used to select both inside groups and across groups, this procedure is equivalent to simply doing lookahead with $K^2$ candidates.

Stacking another lookahead on top will be equivalent to doing a single lookahead with $K^3$ candidates and so on. In other words, reaching the effective power of max-q improvement operator in an action space with $N$ actions requires just $O(log_K N)$ lookahead compositions, which is a very reasonable number even for very large action spaces.


### Lookahead-based Reinforcement Learning

The above property suggests a straightforward RL algorithm:

1. Evaluate the current policy $\pi$ to get $q_{\pi}(s, a)$,
* e.g. by sampling some trajectories from $\pi$ and then training a model to predict return-to-go.
2. Distill $L_{K, \pi}(a \mid s)$ into a new policy $\pi^{\ast}$,
* e.g. by sampling some trajectories from $L$ and then applying supervised training on this data.
3. Set $\pi$ to $\pi^{\ast}$ and go to the first step unless done.

This algorithm
* Converges to the optimal policy in MDPs and does it fast even in large action spaces (assuming that Q can be approximated well).
* Does not require to explicitly maximize the Q-function over all actions, as this is covered by the repeated application of lookahead.
* Is very robust, as it only uses supervised learning as a learning subroutine. No sketchy non-stationary loss functions!

#### Repeated Policy Evaluation is Not Necessary

Perhaps the most surprising property of the above algorithm is that re-evaluating the policy each time it was updated is not necessary: while $Q_{\pi}$ is not equal to $Q_{\pi^{\ast}}$ (which is expected: $\pi^{\ast}$ is an improvement over $\pi$), these q-functions rank actions identically due to constraints imposed on $\pi^{\ast}$ by the fact that it's not an arbitrary policy, but a lookahead policy made using $\pi$ and $Q_{\pi}$. In practice, however, it would still make sense to re-train $Q$ occasionally because as $\pi$ evolves, optimization emphasis shifts to the regions of state space where the original approximation of $Q$ was likely poor.

Let us now prove that $Q_{\pi}$ and $Q_{\pi^{\ast}}$ rank actions identically.