Skip to content

Training signal could be improved #363

@jpiabrantes

Description

@jpiabrantes

This issue is about the train method on pufferl.py.

All the data (observations, rewards, actions, terminals, etc) have the shape (segments, horizon, ...).

Horizon is usually 64. The data stores 64 states, however, it just stores 63 transitions.

This means rewards[:, 0] and terminals[:, 0] are never used which is a tiny waste, but more importantly advantages[:, -1] is always 0 (before the normalisation). mb_returns[:, -1] is also always equal to mb_values[:, -1].

So the last sample of every segment the pg_loss is ~0 (not actually 0 because of the normalisation of the advantage), the entropy loss makes the policy more random and the value function loss is conservative - i.e. it pushes the value function to its previous prediction.

I think the typical way to fix this is to store 64 transitions instead of 64 states.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions