Skip to content

Critic q loss in PPO agent seems to be wrong #6

@Asad-Shahid

Description

@Asad-Shahid

In line 282 of ppo_agent.py, the critic is trained using:

value_loss = self._config.value_loss_coeff * (ret - value_pred).pow(2).mean()

where ret is computed as ret = adv + vpred[:-1].

This way of calculating return actually gives q_loss but the critic actually predicts v as here.

It seems like the critic is trained using q-loss but it's used to predict only state values. Could you clarify on this?

Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions