Skip to content

[fix] Apply loss mask to KL in REINFORCE++ returns calculation#559

Open
guapisolo wants to merge 7 commits intoradixark:mainfrom
guapisolo:fix_kl_reinforce
Open

[fix] Apply loss mask to KL in REINFORCE++ returns calculation#559
guapisolo wants to merge 7 commits intoradixark:mainfrom
guapisolo:fix_kl_reinforce

Conversation

@guapisolo
Copy link
Collaborator

No description provided.

guapisolo and others added 7 commits February 4, 2026 00:31
Co-Authored-By: miles-code-angel <miles.pr.bot@gmail.com>
Co-Authored-By: miles-code-angel <miles.pr.bot@gmail.com>
Co-Authored-By: miles-code-angel <miles.pr.bot@gmail.com>
Co-Authored-By: miles-code-angel <miles.pr.bot@gmail.com>
Co-Authored-By: miles-code-angel <miles.pr.bot@gmail.com>
Co-Authored-By: miles-code-angel <miles.pr.bot@gmail.com>
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @guapisolo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily addresses a critical bug in the REINFORCE++ returns calculation by ensuring the loss mask is correctly applied to the KL divergence. Beyond this essential fix, it significantly enhances the system's flexibility by allowing custom generation functions to be defined at both dataset and sample levels, catering to more complex evaluation and training setups. Additionally, it improves the robustness of distributed weight updates for quantized models through better synchronization mechanisms and expands the utility of quantization ignore rules.

Highlights

  • Core Fix: REINFORCE++ KL Masking: Correctly applies the loss mask to the KL divergence term in the REINFORCE++ returns calculation, ensuring that only relevant tokens contribute to the KL penalty and improving the accuracy of the policy gradient.
  • Flexible Custom Generation Functions: Introduces support for specifying custom generation functions at both the dataset and individual sample levels, enhancing flexibility for advanced use cases such as tool calling or specific evaluation scenarios.
  • Improved Distributed Weight Update Synchronization: Refines the synchronization logic during distributed weight updates for quantized models by ensuring proper pausing/resuming of generation and adding distributed barriers, leading to more robust and predictable behavior.
  • Expanded Quantization Ignore Rules: Extends the ignore rules for parameter quantization to include prefix matching, providing more granular control over which parameters are excluded from the quantization process.
Changelog
  • miles/backends/megatron_utils/megatron_to_hf/processors/quantizer_compressed_tensors.py
    • Expanded quantization ignore rules to support prefix matching for parameter names.
  • miles/backends/megatron_utils/update_weight/update_weight_from_distributed.py
    • Reordered continue_generation call and added a distributed barrier to ensure proper synchronization during weight updates for quantized models.
  • miles/backends/megatron_utils/update_weight/update_weight_from_tensor.py
    • Adjusted the order of pause_generation and flush_cache calls and added a distributed barrier for better synchronization during weight updates.
  • miles/rollout/sglang_rollout.py
    • Implemented logic to update sample.loss_mask during partial rollouts when off-policy masking is enabled.
    • Modified generate_and_rm to prioritize per-sample custom generation function paths over global arguments.
    • Truncated sample.label in logging messages to prevent excessively long output.
    • Propagated custom_generate_function_path from dataset configuration to individual samples during evaluation.
  • miles/utils/arguments.py
    • Added an assertion to prevent misconfiguration of pipeline model parallel layers when pipeline_model_parallel_size is 1.
    • Made a minor formatting adjustment in a help string.
  • miles/utils/data.py
    • Corrected filter_long_prompt to always return a list of samples, even when no filtering is applied.
  • miles/utils/eval_config.py
    • Added custom_generate_function_path field to EvalDatasetConfig for per-dataset custom generation functions.
  • miles/utils/ppo_utils.py
    • Fixed a bug by applying the loss_mask directly to the full_kl_response before calculating token_level_rewards in get_reinforce_plus_plus_returns.
  • miles/utils/types.py
    • Added generate_function_path field to the Sample class to support per-sample custom generation functions.
Activity
  • No specific activity (comments, reviews, progress) has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a key fix in the REINFORCE++ returns calculation by applying the loss mask to the KL divergence, ensuring that the KL penalty is only applied to on-policy tokens. It also includes several other valuable improvements: a bug fix in filter_long_prompt to return the correct type, a new feature to support per-sample/per-dataset custom generation functions, and better argument validation.

I've found one minor issue in miles/utils/ppo_utils.py where a line of code became redundant after the changes. My suggestion is to remove it for cleaner code. Overall, these are great changes that improve correctness and add useful flexibility.

Comment on lines +259 to +260
masked_kl = full_kl_response * full_mask
token_level_rewards = -kl_coef * masked_kl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The assignment to token_level_rewards on line 256 is now redundant because it's immediately overwritten on line 260. To make the code cleaner and more efficient, you can remove the initial assignment. Since line 256 is a context line and cannot be modified in this review, I'm suggesting a change here that effectively does that by combining these two lines. The logic should be to compute masked_kl and then initialize token_level_rewards from it.

Suggested change
masked_kl = full_kl_response * full_mask
token_level_rewards = -kl_coef * masked_kl
token_level_rewards = -kl_coef * (full_kl_response * full_mask)

@fzyzcjy fzyzcjy requested a review from maocheng23 as a code owner February 7, 2026 05:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants