-
Notifications
You must be signed in to change notification settings - Fork 380
Description
This issue is about the train method on pufferl.py.
All the data (observations, rewards, actions, terminals, etc) have the shape (segments, horizon, ...).
Horizon is usually 64. The data stores 64 states, however, it just stores 63 transitions.
This means rewards[:, 0] and terminals[:, 0] are never used which is a tiny waste, but more importantly advantages[:, -1] is always 0 (before the normalisation). mb_returns[:, -1] is also always equal to mb_values[:, -1].
So the last sample of every segment the pg_loss is ~0 (not actually 0 because of the normalisation of the advantage), the entropy loss makes the policy more random and the value function loss is conservative - i.e. it pushes the value function to its previous prediction.
I think the typical way to fix this is to store 64 transitions instead of 64 states.