Skip to content

Conversation

@eitanporat
Copy link
Contributor

Summary

  • Simple environment where agent must pick action 0 on step 1 to win
  • Episode terminates at step 128, reward given only at termination
  • 2 discrete actions, 50% random baseline
  • PufferLib with bptt_horizon=64 cannot beat the credit assignment for 128 length episodes

Results:

  • Agent that always picks action 0: 100%
  • Random agent: 50%
  • PufferLib with bptt_horizon=64: 50%

The core issue is that truncated BPTT cuts off gradients at segment boundaries, preventing credit from flowing back to early actions. A potential fix would be to perform two forward passes per rollout: the first to collect experiences, and the second (after seeing more of the trajectory) to compute improved bootstrap value estimates at segment boundaries. This would allow the value function to incorporate information beyond the BPTT horizon without requiring full backpropagation through the entire episode.

I leave this as an open problem for other contributors.

Simple environment where agent must pick action 0 on step 1 to win.
Episode terminates at step 128, reward given only at termination.
- 2 discrete actions
- 50% random baseline
- Tests long-horizon credit assignment with BPTT
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant