Problem
In PolicyRunner (or wherever on_frame is dispatched), exceptions from the
hook are caught and logged at WARN level, then iteration continues:
try:
on_frame(step_count, observation, action_dict)
except CooperativeStop:
raise
except Exception as e:
logger.warning("on_frame hook raised: %s", e)
Failure mode: if the recording hook has a real bug (e.g. obs["missing_key"]
or a dead mount point on the video writer), it throws on every step. A
500-step episode produces 500 log lines, then completes "successfully" with
zero frames written. The dataset is silently empty. The agent learns nothing.
Fix
Count consecutive on_frame failures. After N in a row (5? 10?), raise and
fail the episode with the underlying exception chained.
_consecutive_onframe_failures = 0
_MAX_CONSECUTIVE_FAILURES = 5
try:
on_frame(step_count, observation, action_dict)
_consecutive_onframe_failures = 0
except CooperativeStop:
raise
except Exception as e:
_consecutive_onframe_failures += 1
logger.warning("on_frame hook failed (%d/%d): %s",
_consecutive_onframe_failures, _MAX_CONSECUTIVE_FAILURES, e)
if _consecutive_onframe_failures >= _MAX_CONSECUTIVE_FAILURES:
raise RuntimeError(
f"on_frame hook failed {_MAX_CONSECUTIVE_FAILURES} times in a row; aborting episode"
) from e
Acceptance
- Threshold documented + configurable (env var or PolicyRunner ctor arg)
- Regression test: hook that always raises → episode fails fast instead of
silently completing
- Regression test: hook that raises once then recovers → counter resets
Surfaced by a second-opinion review of PR #85.
Problem
In
PolicyRunner(or whereveron_frameis dispatched), exceptions from thehook are caught and logged at WARN level, then iteration continues:
Failure mode: if the recording hook has a real bug (e.g.
obs["missing_key"]or a dead mount point on the video writer), it throws on every step. A
500-step episode produces 500 log lines, then completes "successfully" with
zero frames written. The dataset is silently empty. The agent learns nothing.
Fix
Count consecutive
on_framefailures. After N in a row (5? 10?), raise andfail the episode with the underlying exception chained.
Acceptance
silently completing
Surfaced by a second-opinion review of PR #85.