[feat] add keep_terminal flag to support keeping terminal observation#6
Open
adzcai wants to merge 1 commit intoEdanToledo:mainfrom
Open
[feat] add keep_terminal flag to support keeping terminal observation#6adzcai wants to merge 1 commit intoEdanToledo:mainfrom
keep_terminal flag to support keeping terminal observation#6adzcai wants to merge 1 commit intoEdanToledo:mainfrom
Conversation
Owner
|
I have seen this, i'm just unable to review this week. Will try handle asap |
Owner
|
So, i've been thinking about this more and im very PRO making this a choice. I will try and review the PR this week. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See discussion in EdanToledo/Stoix#181. The issue is basically how to handle episode boundaries. Suppose
base_env.stepenters a state withtimestep.done() == True(either terminated or truncated). So the question is what an auto-reset wrapper should return:base_env.resetand keep the other timestep properties frombase_env.step. For proper bootstrapping in algorithms, though, this requires doubling the number of critic evaluations.base_env.stepand returnbase_env.reset, with a dummy reward and discount, on the next call towrapped_env.step. This might require masking out any losses computed based on the policy's actions in the final state.Both of these are valid choices and we should enable the user to decide which they prefer. For example, in settings where evaluating the critic involves some form of search, option two would incur half the number of critic invocations as option one.
This PR also fixes #5 by re-implementing the optimistic auto reset wrapper.