Skip to content

[feat] add keep_terminal flag to support keeping terminal observation#6

Open
adzcai wants to merge 1 commit intoEdanToledo:mainfrom
adzcai:feat/keep-terminal-flag
Open

[feat] add keep_terminal flag to support keeping terminal observation#6
adzcai wants to merge 1 commit intoEdanToledo:mainfrom
adzcai:feat/keep-terminal-flag

Conversation

@adzcai
Copy link

@adzcai adzcai commented Oct 10, 2025

See discussion in EdanToledo/Stoix#181. The issue is basically how to handle episode boundaries. Suppose base_env.step enters a state with timestep.done() == True (either terminated or truncated). So the question is what an auto-reset wrapper should return:

  1. Return the state and observation from base_env.reset and keep the other timestep properties from base_env.step. For proper bootstrapping in algorithms, though, this requires doubling the number of critic evaluations.
  2. Just return everything from base_env.step and return base_env.reset, with a dummy reward and discount, on the next call to wrapped_env.step. This might require masking out any losses computed based on the policy's actions in the final state.

Both of these are valid choices and we should enable the user to decide which they prefer. For example, in settings where evaluating the critic involves some form of search, option two would incur half the number of critic invocations as option one.

This PR also fixes #5 by re-implementing the optimistic auto reset wrapper.

@EdanToledo
Copy link
Owner

I have seen this, i'm just unable to review this week. Will try handle asap

@EdanToledo
Copy link
Owner

So, i've been thinking about this more and im very PRO making this a choice. I will try and review the PR this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug] Confusing optimistic auto reset implementation

2 participants