Skip to content

Auto pick num nodes in preparing megatron ckpt#607

Open
fzyzcjy wants to merge 3 commits intofeat/verify_donefrom
feat/generalize_num_nodes
Open

Auto pick num nodes in preparing megatron ckpt#607
fzyzcjy wants to merge 3 commits intofeat/verify_donefrom
feat/generalize_num_nodes

Conversation

@fzyzcjy
Copy link
Collaborator

@fzyzcjy fzyzcjy commented Feb 15, 2026

No description provided.

The converter asserts world_size <= num_layers, which fails when using
all 8 nodes (64 GPUs) for a 20-layer model. Add num_nodes parameter to
exec_command_all_ray_node and convert_checkpoint so callers can limit
the number of nodes used. For DeepSeek-V3-0324-20layer, use only 2
nodes (16 GPUs) for conversion.
@fzyzcjy fzyzcjy changed the base branch from main to feat/verify_done February 15, 2026 14:42
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @fzyzcjy, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the infrastructure for distributed model training and checkpoint preparation by introducing a robust Ray-based utility for executing commands across multiple nodes. This change streamlines multinode operations and enables dynamic configuration of distributed tasks. Furthermore, it standardizes file paths across various training scripts, improving configurability and maintainability. The checkpoint conversion process has also been made more reliable through better synchronization mechanisms.

Highlights

  • Multinode Command Execution Utility: Introduced a new utility function, exec_command_all_ray_node, to execute shell commands across all active Ray nodes in parallel, supporting dynamic placeholder substitution for node-specific parameters.
  • Enhanced Megatron Checkpoint Conversion: The convert_checkpoint function was updated to leverage the new exec_command_all_ray_node for multinode setups, allowing for automatic determination of the number of nodes and more flexible path configurations.
  • Standardized Path Management: New ScriptArgs parameters (data_dir, model_dir, model_local_dir, megatron_path, output_dir) were added and consistently applied across various training scripts to standardize and centralize the management of file paths.
  • Refactored Training Script Logic: Training scripts like run_deepseek.py and run_glm45_355b_a32b.py were refactored into modular helper functions (_prepare_download, _prepare_bf16_ckpt, _prepare_megatron_ckpt, _prepare_cp, _execute_train) for improved organization and reusability.
  • Improved Checkpoint Tracker Synchronization: The logic for updating the checkpoint tracker in convert_hf_to_torch_dist.py was reordered to ensure the 'release' status is written only after all distributed processes have synchronized, enhancing reliability.
Changelog
  • miles/utils/external_utils/command_utils.py
    • Imported partial from functools and exec_command_all_ray_node from miles.utils.misc.
    • Modified convert_checkpoint to accept num_nodes and megatron_path arguments, and to use exec_command_all_ray_node for multinode execution with dynamic placeholder substitution.
    • Updated rsync_simple to use exec_command_all_ray_node.
    • Modified hf_download_dataset to accept a data_dir argument.
    • Updated fp8_cast_bf16 to use a more robust sentinel file check and repo_base_dir.
    • Added output_dir to ExecuteTrainConfig.
    • Modified execute_train to accept megatron_path and use config.output_dir for CUDA coredump files.
  • miles/utils/misc.py
    • Imported re and NodeAffinitySchedulingStrategy.
    • Added _exec_command_on_node remote Ray function.
    • Implemented exec_command_all_ray_node to execute commands on all Ray nodes with placeholder support.
  • scripts/run_deepseek.py
    • Added data_dir, model_dir, model_local_dir, megatron_path to ScriptArgs.
    • Refactored prepare_single, prepare_spmd, prepare_cp, train into internal helper functions (_prepare_download, _prepare_bf16_ckpt, _prepare_megatron_ckpt, _prepare_cp, _execute_train).
    • Updated all file paths in these functions to use the new ScriptArgs parameters.
    • Modified _prepare_megatron_ckpt to automatically determine num_nodes and num_gpus_per_node for specific models.
    • Created a new train command that orchestrates the helper functions.
  • scripts/run_glm45_355b_a32b.py
    • Added data_dir, model_dir, model_local_dir, megatron_path to ScriptArgs.
    • Refactored prepare_single, prepare_spmd, prepare_cp, train into internal helper functions (_prepare_download, _convert_hf_to_fp8, _prepare_megatron_ckpt, _prepare_cp, _execute_train).
    • Updated all file paths in these functions to use the new ScriptArgs parameters.
    • Created a new train command that orchestrates the helper functions.
  • scripts/run_mcore_fsdp.py
    • Added data_dir, model_dir, megatron_path to ScriptArgs.
    • Updated prepare and execute functions to use the new ScriptArgs parameters for file paths.
  • scripts/run_qwen3_30b_a3b.py
    • Added data_dir, model_dir, megatron_path to ScriptArgs.
    • Updated prepare and execute functions to use the new ScriptArgs parameters for file paths.
  • scripts/run_qwen3_4b.py
    • Added data_dir, model_dir, megatron_path to ScriptArgs.
    • Updated prepare and execute functions to use the new ScriptArgs parameters for file paths.
  • tools/convert_hf_to_torch_dist.py
    • Reordered the writing of the 'release' checkpoint tracker file to occur after dist.barrier() to ensure proper synchronization in distributed environments.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a feature to automatically select the number of nodes for preparing Megatron checkpoints. The changes add a num_nodes parameter to exec_command_all_ray_node and convert_checkpoint functions, and run_deepseek.py is updated to use this for specific models. The implementation looks good. I've added one comment regarding an edge case where num_nodes could be zero, which would cause a crash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant