[fix] Apply loss mask to KL in REINFORCE++ returns calculation by guapisolo · Pull Request #559 · radixark/miles

guapisolo · 2026-02-04T01:28:20Z

No description provided.

Co-Authored-By: miles-code-angel <miles.pr.bot@gmail.com>

gemini-code-assist · 2026-02-04T01:28:44Z

Summary of Changes

Hello @guapisolo, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily addresses a critical bug in the REINFORCE++ returns calculation by ensuring the loss mask is correctly applied to the KL divergence. Beyond this essential fix, it significantly enhances the system's flexibility by allowing custom generation functions to be defined at both dataset and sample levels, catering to more complex evaluation and training setups. Additionally, it improves the robustness of distributed weight updates for quantized models through better synchronization mechanisms and expands the utility of quantization ignore rules.

Highlights

Core Fix: REINFORCE++ KL Masking: Correctly applies the loss mask to the KL divergence term in the REINFORCE++ returns calculation, ensuring that only relevant tokens contribute to the KL penalty and improving the accuracy of the policy gradient.
Flexible Custom Generation Functions: Introduces support for specifying custom generation functions at both the dataset and individual sample levels, enhancing flexibility for advanced use cases such as tool calling or specific evaluation scenarios.
Improved Distributed Weight Update Synchronization: Refines the synchronization logic during distributed weight updates for quantized models by ensuring proper pausing/resuming of generation and adding distributed barriers, leading to more robust and predictable behavior.
Expanded Quantization Ignore Rules: Extends the ignore rules for parameter quantization to include prefix matching, providing more granular control over which parameters are excluded from the quantization process.

Changelog

miles/backends/megatron_utils/megatron_to_hf/processors/quantizer_compressed_tensors.py
- Expanded quantization ignore rules to support prefix matching for parameter names.
miles/backends/megatron_utils/update_weight/update_weight_from_distributed.py
- Reordered continue_generation call and added a distributed barrier to ensure proper synchronization during weight updates for quantized models.
miles/backends/megatron_utils/update_weight/update_weight_from_tensor.py
- Adjusted the order of pause_generation and flush_cache calls and added a distributed barrier for better synchronization during weight updates.
miles/rollout/sglang_rollout.py
- Implemented logic to update sample.loss_mask during partial rollouts when off-policy masking is enabled.
- Modified generate_and_rm to prioritize per-sample custom generation function paths over global arguments.
- Truncated sample.label in logging messages to prevent excessively long output.
- Propagated custom_generate_function_path from dataset configuration to individual samples during evaluation.
miles/utils/arguments.py
- Added an assertion to prevent misconfiguration of pipeline model parallel layers when pipeline_model_parallel_size is 1.
- Made a minor formatting adjustment in a help string.
miles/utils/data.py
- Corrected filter_long_prompt to always return a list of samples, even when no filtering is applied.
miles/utils/eval_config.py
- Added custom_generate_function_path field to EvalDatasetConfig for per-dataset custom generation functions.
miles/utils/ppo_utils.py
- Fixed a bug by applying the loss_mask directly to the full_kl_response before calculating token_level_rewards in get_reinforce_plus_plus_returns.
miles/utils/types.py
- Added generate_function_path field to the Sample class to support per-sample custom generation functions.

Activity

No specific activity (comments, reviews, progress) has been recorded for this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a key fix in the REINFORCE++ returns calculation by applying the loss mask to the KL divergence, ensuring that the KL penalty is only applied to on-policy tokens. It also includes several other valuable improvements: a bug fix in filter_long_prompt to return the correct type, a new feature to support per-sample/per-dataset custom generation functions, and better argument validation.

I've found one minor issue in miles/utils/ppo_utils.py where a line of code became redundant after the changes. My suggestion is to remove it for cleaner code. Overall, these are great changes that improve correctness and add useful flexibility.

gemini-code-assist · 2026-02-04T01:31:00Z

miles/utils/ppo_utils.py

+        masked_kl = full_kl_response * full_mask
+        token_level_rewards = -kl_coef * masked_kl


The assignment to token_level_rewards on line 256 is now redundant because it's immediately overwritten on line 260. To make the code cleaner and more efficient, you can remove the initial assignment. Since line 256 is a context line and cannot be modified in this review, I'm suggesting a change here that effectively does that by combining these two lines. The logic should be to compute masked_kl and then initialize token_level_rewards from it.

Suggested change

masked_kl = full_kl_response * full_mask

token_level_rewards = -kl_coef * masked_kl

token_level_rewards = -kl_coef * (full_kl_response * full_mask)

guapisolo and others added 7 commits February 4, 2026 00:31

bugfix on UpdateWeightFromDistributed

edff922

Co-Authored-By: miles-code-angel <miles.pr.bot@gmail.com>

sync feature and bugfix

8ee2f56

Co-Authored-By: miles-code-angel <miles.pr.bot@gmail.com>

revert some deletions

f5e8c1d

Co-Authored-By: miles-code-angel <miles.pr.bot@gmail.com>

add fsdp assert back

49ebac9

Co-Authored-By: miles-code-angel <miles.pr.bot@gmail.com>

fix the loss mask for mask_offpolicy_in_partial_rollout

0d90f55

Co-Authored-By: miles-code-angel <miles.pr.bot@gmail.com>

[Fix] Return origin_samples instead of False in filter_long_prompt

6ccc3ca

Co-Authored-By: miles-code-angel <miles.pr.bot@gmail.com>

[fix] Apply loss mask to KL in REINFORCE++ returns calculation

1246c25

guapisolo requested review from fzyzcjy and yueming-yuan as code owners February 4, 2026 01:28

gemini-code-assist bot reviewed Feb 4, 2026

View reviewed changes

fzyzcjy requested a review from maocheng23 as a code owner February 7, 2026 05:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] Apply loss mask to KL in REINFORCE++ returns calculation#559

[fix] Apply loss mask to KL in REINFORCE++ returns calculation#559
guapisolo wants to merge 7 commits intoradixark:mainfrom
guapisolo:fix_kl_reinforce

guapisolo commented Feb 4, 2026

Uh oh!

gemini-code-assist bot commented Feb 4, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		masked_kl = full_kl_response * full_mask
		token_level_rewards = -kl_coef * masked_kl

	masked_kl = full_kl_response * full_mask
	token_level_rewards = -kl_coef * masked_kl
	token_level_rewards = -kl_coef * (full_kl_response * full_mask)

Conversation

guapisolo commented Feb 4, 2026

Uh oh!

gemini-code-assist bot commented Feb 4, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants