Fix policy loss gradient in TD3#249
Open
ethanluoyc wants to merge 1 commit intogoogle-deepmind:masterfrom
Open
Fix policy loss gradient in TD3#249ethanluoyc wants to merge 1 commit intogoogle-deepmind:masterfrom
ethanluoyc wants to merge 1 commit intogoogle-deepmind:masterfrom
Conversation
The gradients dq_da is currently incorrect. The gradients for each dimension from the action should be summed instead of averaged as per https://github.com/deepmind/rlax/blob/master/rlax/_src/policy_gradients_test.py#L55 For example, the D4PG agent also doesn't sum over the action dimension. In the case where someone may wish to write a D4PG-BC agent, this may cause a similar issue. The current policy loss is fine if using this version to run online TD3. When using an optimizer such as Adam, which normalizes the gradients based on the magnitude, the constant should not affect the computed policy updates. However, the current policy loss computation can be problematic if the user wants to use Acme's version of TD3 to reproduce results from the TD3-BC paper using the TD3-BC paper's default `bc_alpha` hyperparameter (which is 2.5). Without the sum, the relative magnitude of the gradient from the critic and the bc loss is different compared to the original implementation. I have noticed that this version of TD3 performs badly on some of the D4RL locomotion datasets (e.g., hopper-medium-replay-v2). I have found that without summing over the action dimension, the evaluation return is very unstable.
Contributor
Author
|
@alexis-jacq FYI since we had a discussion last time about some implementation differences between this TD3 and the official PyTorch implementation. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The gradients dq_da is currently incorrect. The gradients for each dimension from the action should be summed instead of averaged as per
https://github.com/deepmind/rlax/blob/master/rlax/_src/policy_gradients_test.py#L55
For example, the D4PG agent also doesn't sum over the action dimension. In the case where someone may wish to write a D4PG-BC agent, this may cause a similar issue.
The current policy loss is fine if using this version to run online TD3. When using an optimizer such as Adam, which normalizes the gradients based on the magnitude, the constant should not affect the computed policy updates.
However, the current policy loss computation can be problematic if the user wants to use Acme's version of TD3 to reproduce results from the TD3-BC paper using the TD3-BC paper's default
bc_alphahyperparameter (which is 2.5). Without the sum, the relative magnitude of the gradient from the critic and the bc loss is different compared to the original implementation. I have noticed that this version of TD3 performs badly on some of the D4RL locomotion datasets (e.g., hopper-medium-replay-v2). I have found that without summing over the action dimension, the evaluation return is very unstable.