Conversation
There was a problem hiding this comment.
I think this example got here by inertia from the previous PR
| else: | ||
| scores = all_scores[0].clone().detach() | ||
| # Best-of-N Sampling. | ||
| scores_mask = scores != -1 |
There was a problem hiding this comment.
I think we need to merge changes from your last PR in
| self.push_to_store(ppo_rl_elements) | ||
|
|
||
| @staticmethod | ||
| def get_topk_indices(input_tensor, window_size: int, k: int, device): |
There was a problem hiding this comment.
Nit: maybe docstring should be added specifying that this isn't the same as regular topk but rather a topk overw window_size
Good point, the benefit of BoN trainings seems to be problem dependent. I've seen the most benefit during training on problems where the model has a low pass@1 score. |
|
@maxreciprocate If you're happy with this do you want to merge today? |
|
@Dahoas There are some run differences when using the default config without BoN sampling, most notably for the randomwalks case: |
Let me look into why. |
|
@Dahoas Not sure if that's the issue however, see: https://wandb.ai/sorry/trlx/reports/Difference-due-to-the-change-in-base_trainer-decode--Vmlldzo1MzE2OTg4 (+ some non-determinism) |




No description provided.