Skip to content

[fix & feature] fix weight update from distributed and extend sample fields#548

Open
guapisolo wants to merge 9 commits intoradixark:mainfrom
guapisolo:auto/20260202221742
Open

[fix & feature] fix weight update from distributed and extend sample fields#548
guapisolo wants to merge 9 commits intoradixark:mainfrom
guapisolo:auto/20260202221742

Conversation

@guapisolo
Copy link
Collaborator

Avoid experimental feature to be deleted

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @guapisolo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request synchronizes internal features and bug fixes, primarily focusing on enhancing the robustness and flexibility of distributed training and evaluation workflows. Key improvements include more granular control over quantization ignore rules, refined synchronization mechanisms for distributed weight updates, and the introduction of per-sample and per-dataset custom generation function paths for more adaptable evaluation strategies. These changes aim to prevent the accidental removal of experimental features and ensure stable operation in complex distributed environments.

Highlights

  • Enhanced Quantization Ignore Rules: The logic for ignoring parameters during quantization has been made more flexible, now supporting prefix matching in addition to exact name matches and regular expressions.
  • Improved Distributed Weight Update Synchronization: Synchronization mechanisms for distributed weight updates have been refined across multiple files, ensuring proper pausing and resuming of generation, and correct ordering of quantization post-processing steps using distributed barriers.
  • Flexible Custom Generation Functions: The system now supports specifying custom generation function paths at both the individual sample level and the dataset configuration level, allowing for more granular control over generation logic during evaluation.
  • Refined Argument Validation: An assertion related to context parallelism for FSDP backend was removed, and a new assertion was added to ensure consistency in pipeline parallelism configuration when pipeline_model_parallel_size is 1.
Changelog
  • miles/backends/megatron_utils/megatron_to_hf/processors/quantizer_compressed_tensors.py
    • Modified the is_ignored logic in quantize_params_compressed_tensors to include prefix matching (name.startswith(r)) for ignore rules, increasing flexibility.
  • miles/backends/megatron_utils/update_weight/update_weight_from_distributed.py
    • Adjusted synchronization logic by moving dist.barrier before the rank 0 block and reordering ray.get calls for continue_generation to ensure quantization post-processing completes before generation resumes.
    • Added a blank line for minor formatting.
  • miles/backends/megatron_utils/update_weight/update_weight_from_tensor.py
    • Introduced ray.get([engine.pause_generation.remote()]) at the start of the rank 0 block to pause generation before weight updates.
    • Added dist.barrier(group=get_gloo_group()) before the post-processing block to ensure all ranks are synchronized.
    • Moved ray.get([engine.continue_generation.remote()]) to occur after post_process_weights within the rank 0 block, ensuring generation resumes only after post-processing.
  • miles/rollout/sglang_rollout.py
    • Modified generate_and_rm to prioritize sample.generate_function_path for custom generation functions, falling back to args.custom_generate_function_path.
    • Truncated sample.label to 100 characters in logger.info calls for better log readability.
    • Propagated custom_generate_function_path from dataset_cfg to individual sample objects during eval_rollout_single_dataset.
  • miles/utils/arguments.py
    • Fixed a minor formatting issue in a help string by adding a closing parenthesis.
    • Removed an assertion that args.context_parallel_size == 1 for FSDP backend.
    • Added a new assertion: if pipeline_model_parallel_size == 1, then decoder_first_pipeline_num_layers and decoder_last_pipeline_num_layers must be None.
  • miles/utils/eval_config.py
    • Added custom_generate_function_path: str | None = None to the EvalDatasetConfig dataclass, allowing per-dataset custom generation function specification.
  • miles/utils/types.py
    • Added generate_function_path: str | None = None to the Sample dataclass, enabling individual samples to carry their own custom generation function path.
Activity
  • No specific activity (comments, reviews, etc.) was provided for this pull request.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several features, such as the ability to specify custom generation functions on a per-sample or per-dataset basis, improved distributed weight updates with better synchronization, and enhanced logging. However, it introduces a critical security flaw by allowing arbitrary Python functions to be loaded and executed based on paths specified in evaluation configurations or datasets. This feature requires a redesign to use a strict allow-list of permitted functions or a more secure plugin architecture. Additionally, a potential issue exists in quantizer_compressed_tensors.py where the logic for ignoring parameters with regex rules might lead to incorrect behavior.

Comment on lines +218 to +221
custom_func_path = getattr(sample, "generate_function_path", None) or args.custom_generate_function_path

if custom_func_path is not None:
custom_generate_func = load_function(custom_func_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-critical critical

The introduction of sample.generate_function_path as a source for load_function creates a critical Arbitrary Code Execution (ACE) vulnerability. This attribute can be populated from evaluation dataset configurations (dataset_cfg.custom_generate_function_path) or potentially from the dataset itself if it is loaded via Sample.from_dict. Since load_function dynamically imports and returns any attribute from any module in the Python path, and the resulting object is subsequently called (lines 224, 226), an attacker who can influence the evaluation configuration or the dataset can execute arbitrary Python code on the worker nodes. This is particularly dangerous in environments where datasets or configurations are shared or sourced from external repositories.

sample.index = sample_index
sample_index += 1
sample.metadata = dataset_cfg.inject_metadata(getattr(sample, "metadata", None))
sample.generate_function_path = getattr(dataset_cfg, "custom_generate_function_path", None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-critical critical

Setting sample.generate_function_path directly from dataset_cfg without validation allows the evaluation configuration to dictate which Python functions are executed during the rollout process. If the evaluation configuration is sourced from an untrusted or user-supplied file, this leads to arbitrary code execution.

for name, param in converted_named_params:
is_ignored = any((r.startswith("re:") and re.match(r[3:], name)) or r == name for r in ignore_rules)
is_ignored = any(
(r.startswith("re:") and re.match(r[3:], name)) or r == name or name.startswith(r) for r in ignore_rules
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current logic for checking ignore rules has a potential issue. A rule prefixed with re: is intended to be a regular expression, but it's also being checked as a prefix with name.startswith(r). This could lead to unintended matches. For example, if an ignore rule is "re:model.layers" and a parameter is named "re:model.layers.attention", it would be incorrectly ignored due to the prefix match, even if the regex itself doesn't match.

To fix this, we should separate the logic for regex rules and plain string rules (exact match or prefix).

Suggested change
(r.startswith("re:") and re.match(r[3:], name)) or r == name or name.startswith(r) for r in ignore_rules
(re.match(r[3:], name) if r.startswith("re:") else (r == name or name.startswith(r))) for r in ignore_rules

@guapisolo guapisolo force-pushed the auto/20260202221742 branch 2 times, most recently from a6083d7 to 70124f2 Compare February 2, 2026 23:31
@guapisolo guapisolo changed the title [sync] sync internal feature and bugfix [sync] sync update weight feature and bugfix Feb 2, 2026
@guapisolo guapisolo marked this pull request as draft February 3, 2026 06:40
@guapisolo guapisolo force-pushed the auto/20260202221742 branch from 27b0313 to 0d168ff Compare February 3, 2026 22:45
@guapisolo guapisolo changed the title [sync] sync update weight feature and bugfix update code Feb 3, 2026
@guapisolo guapisolo marked this pull request as ready for review February 3, 2026 22:52
@guapisolo guapisolo changed the title update code [fix & feature] fix weight update from distributed and extend sample fields Feb 3, 2026
@guapisolo guapisolo force-pushed the auto/20260202221742 branch 2 times, most recently from 49ebac9 to aa96b87 Compare February 6, 2026 18:50
@guapisolo guapisolo requested a review from maocheng23 as a code owner February 6, 2026 18:50
guapisolo and others added 8 commits February 16, 2026 20:12
Co-Authored-By: miles-code-angel <miles.pr.bot@gmail.com>
Co-Authored-By: miles-code-angel <miles.pr.bot@gmail.com>
Co-Authored-By: miles-code-angel <miles.pr.bot@gmail.com>
Co-Authored-By: miles-code-angel <miles.pr.bot@gmail.com>
Tests for: prefix-matching ignore rules in quantizer, new
generate_function_path field on Sample and EvalDatasetConfig,
and per-sample custom generate function path priority logic.
The _create_with_all_fields check requires all Sample fields to be
explicitly provided. Adding the new generate_function_path field
to _merge_sample_pair to fix 6 failing tests.
These only tested dataclass field defaults with no real logic,
adding maintenance burden without value.
Consolidate 14 tests into 8 by merging single-param cases into one
parametrize, extracting _is_ignored helper, and removing tests for
pre-existing behavior unrelated to this change.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant