[fix] Apply loss mask to KL in REINFORCE++ returns calculation#559
[fix] Apply loss mask to KL in REINFORCE++ returns calculation#559guapisolo wants to merge 7 commits intoradixark:mainfrom
Conversation
Co-Authored-By: miles-code-angel <miles.pr.bot@gmail.com>
Co-Authored-By: miles-code-angel <miles.pr.bot@gmail.com>
Co-Authored-By: miles-code-angel <miles.pr.bot@gmail.com>
Co-Authored-By: miles-code-angel <miles.pr.bot@gmail.com>
Co-Authored-By: miles-code-angel <miles.pr.bot@gmail.com>
Co-Authored-By: miles-code-angel <miles.pr.bot@gmail.com>
Summary of ChangesHello @guapisolo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request primarily addresses a critical bug in the REINFORCE++ returns calculation by ensuring the loss mask is correctly applied to the KL divergence. Beyond this essential fix, it significantly enhances the system's flexibility by allowing custom generation functions to be defined at both dataset and sample levels, catering to more complex evaluation and training setups. Additionally, it improves the robustness of distributed weight updates for quantized models through better synchronization mechanisms and expands the utility of quantization ignore rules. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a key fix in the REINFORCE++ returns calculation by applying the loss mask to the KL divergence, ensuring that the KL penalty is only applied to on-policy tokens. It also includes several other valuable improvements: a bug fix in filter_long_prompt to return the correct type, a new feature to support per-sample/per-dataset custom generation functions, and better argument validation.
I've found one minor issue in miles/utils/ppo_utils.py where a line of code became redundant after the changes. My suggestion is to remove it for cleaner code. Overall, these are great changes that improve correctness and add useful flexibility.
| masked_kl = full_kl_response * full_mask | ||
| token_level_rewards = -kl_coef * masked_kl |
There was a problem hiding this comment.
The assignment to token_level_rewards on line 256 is now redundant because it's immediately overwritten on line 260. To make the code cleaner and more efficient, you can remove the initial assignment. Since line 256 is a context line and cannot be modified in this review, I'm suggesting a change here that effectively does that by combining these two lines. The logic should be to compute masked_kl and then initialize token_level_rewards from it.
| masked_kl = full_kl_response * full_mask | |
| token_level_rewards = -kl_coef * masked_kl | |
| token_level_rewards = -kl_coef * (full_kl_response * full_mask) |
No description provided.