AV vs Q/P consistency loss term

We update `Q` at the parent to be the `P`-weighted average of the `Q` of the children during MCTS backpropagation.

This suggests that we could add a loss term based on the difference between `V` and `V*`, where:

* `V` is the output of the value-head
* `V*` is  the _implied_ value, gotten by taking the `P`-weighted average of the `AV` head

For betazero, we could add a similar loss term based on `U` (value-uncertainty) and `AU` (action-value-uncertainty).

This begs the question: if, with or without such a loss term, there are positions where there is a big gap between these two...

1. What does that mean?
2. How does that impact MCTS mechanics?
3. Can we mitigate/prevent such occurrences?

Without going into too much detail, here are some of my tentative answers:

1. _It means that there is a _generational lag_ between the `V` head and the `AV` head._
2. _If `AV` is an **overestimate**, that's ok, because it merely causes an extra initial visit to that child, and is quickly overridden with the `V` output of that child. If `AV` is an **underestimate**, it can result in the child never getting a visit when one is perhaps warranted. This is not as ok._
3. _One idea is to have the `V` head loop back as an input into the `AV` head, and for the `AV` head to predict a **delta** between `V` and `AV`, rather than `AV` directly. This can be thought of as the difference between "predict the `V` of each child state" vs "predict the relative differences between the `V`'s of the child states". Similarly to how a softmaxed output head is really tasked with relative-predictions rather than absolute-predictions. I find it plausible that such a wiring could be more robust._

In the betazero context, I find it further plausible that we can have a gap between `V` and `V*` impact `U` (value-uncertainty), either through a loss term, or by some sort of adjustment/smoothening performed on the fly.

This warrants a lot of experimentation, performed across multiple games.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AV vs Q/P consistency loss term #231

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

AV vs Q/P consistency loss term #231

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions