-
Notifications
You must be signed in to change notification settings - Fork 105
Description
The computation of target Q in the SERL SAC code, critic_loss_fn() has a potential bug.
In this file, if you set config['backup_entropy']=True, the term temperature * next_action_log_probs is subtracted from target_q. This is mathematically equivalent to
where y(r,s',d) = target_q, r = batch['rewards'], config['discount'], (1-d) = batch['masks'], target_next_min_q, temperature, next_actions_log_probs,
But the formula for target_q should be
i.e. the \gamma*(1-d). This is so the entropy term is appropriately weighted by the discount factor so that your value function calculations are accurate.
Sources:
[1] SAC paper, see eq 3 for Value function
[2] Spinning up RL by OpenAI, SAC pseudocode, see line 12 for computing target q values.
The fix is quite simple. You subtract the