-
Notifications
You must be signed in to change notification settings - Fork 1
Description
We update Q at the parent to be the P-weighted average of the Q of the children during MCTS backpropagation.
This suggests that we could add a loss term based on the difference between V and V*, where:
Vis the output of the value-headV*is the implied value, gotten by taking theP-weighted average of theAVhead
For betazero, we could add a similar loss term based on U (value-uncertainty) and AU (action-value-uncertainty).
This begs the question: if, with or without such a loss term, there are positions where there is a big gap between these two...
- What does that mean?
- How does that impact MCTS mechanics?
- Can we mitigate/prevent such occurrences?
Without going into too much detail, here are some of my tentative answers:
- It means that there is a generational lag between the
Vhead and theAVhead. - If
AVis an overestimate, that's ok, because it merely causes an extra initial visit to that child, and is quickly overridden with theVoutput of that child. IfAVis an underestimate, it can result in the child never getting a visit when one is perhaps warranted. This is not as ok. - One idea is to have the
Vhead loop back as an input into theAVhead, and for theAVhead to predict a delta betweenVandAV, rather thanAVdirectly. This can be thought of as the difference between "predict theVof each child state" vs "predict the relative differences between theV's of the child states". Similarly to how a softmaxed output head is really tasked with relative-predictions rather than absolute-predictions. I find it plausible that such a wiring could be more robust.
In the betazero context, I find it further plausible that we can have a gap between V and V* impact U (value-uncertainty), either through a loss term, or by some sort of adjustment/smoothening performed on the fly.
This warrants a lot of experimentation, performed across multiple games.