Auto pick num nodes in preparing megatron ckpt by fzyzcjy · Pull Request #607 · radixark/miles

fzyzcjy · 2026-02-15T14:41:59Z

No description provided.

The converter asserts world_size <= num_layers, which fails when using all 8 nodes (64 GPUs) for a 20-layer model. Add num_nodes parameter to exec_command_all_ray_node and convert_checkpoint so callers can limit the number of nodes used. For DeepSeek-V3-0324-20layer, use only 2 nodes (16 GPUs) for conversion.

# Conflicts: # scripts/run_deepseek.py

gemini-code-assist · 2026-02-15T14:42:24Z

Summary of Changes

Hello @fzyzcjy, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the infrastructure for distributed model training and checkpoint preparation by introducing a robust Ray-based utility for executing commands across multiple nodes. This change streamlines multinode operations and enables dynamic configuration of distributed tasks. Furthermore, it standardizes file paths across various training scripts, improving configurability and maintainability. The checkpoint conversion process has also been made more reliable through better synchronization mechanisms.

Highlights

Multinode Command Execution Utility: Introduced a new utility function, exec_command_all_ray_node, to execute shell commands across all active Ray nodes in parallel, supporting dynamic placeholder substitution for node-specific parameters.
Enhanced Megatron Checkpoint Conversion: The convert_checkpoint function was updated to leverage the new exec_command_all_ray_node for multinode setups, allowing for automatic determination of the number of nodes and more flexible path configurations.
Standardized Path Management: New ScriptArgs parameters (data_dir, model_dir, model_local_dir, megatron_path, output_dir) were added and consistently applied across various training scripts to standardize and centralize the management of file paths.
Refactored Training Script Logic: Training scripts like run_deepseek.py and run_glm45_355b_a32b.py were refactored into modular helper functions (_prepare_download, _prepare_bf16_ckpt, _prepare_megatron_ckpt, _prepare_cp, _execute_train) for improved organization and reusability.
Improved Checkpoint Tracker Synchronization: The logic for updating the checkpoint tracker in convert_hf_to_torch_dist.py was reordered to ensure the 'release' status is written only after all distributed processes have synchronized, enhancing reliability.

Changelog

miles/utils/external_utils/command_utils.py
- Imported partial from functools and exec_command_all_ray_node from miles.utils.misc.
- Modified convert_checkpoint to accept num_nodes and megatron_path arguments, and to use exec_command_all_ray_node for multinode execution with dynamic placeholder substitution.
- Updated rsync_simple to use exec_command_all_ray_node.
- Modified hf_download_dataset to accept a data_dir argument.
- Updated fp8_cast_bf16 to use a more robust sentinel file check and repo_base_dir.
- Added output_dir to ExecuteTrainConfig.
- Modified execute_train to accept megatron_path and use config.output_dir for CUDA coredump files.
miles/utils/misc.py
- Imported re and NodeAffinitySchedulingStrategy.
- Added _exec_command_on_node remote Ray function.
- Implemented exec_command_all_ray_node to execute commands on all Ray nodes with placeholder support.
scripts/run_deepseek.py
- Added data_dir, model_dir, model_local_dir, megatron_path to ScriptArgs.
- Refactored prepare_single, prepare_spmd, prepare_cp, train into internal helper functions (_prepare_download, _prepare_bf16_ckpt, _prepare_megatron_ckpt, _prepare_cp, _execute_train).
- Updated all file paths in these functions to use the new ScriptArgs parameters.
- Modified _prepare_megatron_ckpt to automatically determine num_nodes and num_gpus_per_node for specific models.
- Created a new train command that orchestrates the helper functions.
scripts/run_glm45_355b_a32b.py
- Added data_dir, model_dir, model_local_dir, megatron_path to ScriptArgs.
- Refactored prepare_single, prepare_spmd, prepare_cp, train into internal helper functions (_prepare_download, _convert_hf_to_fp8, _prepare_megatron_ckpt, _prepare_cp, _execute_train).
- Updated all file paths in these functions to use the new ScriptArgs parameters.
- Created a new train command that orchestrates the helper functions.
scripts/run_mcore_fsdp.py
- Added data_dir, model_dir, megatron_path to ScriptArgs.
- Updated prepare and execute functions to use the new ScriptArgs parameters for file paths.
scripts/run_qwen3_30b_a3b.py
- Added data_dir, model_dir, megatron_path to ScriptArgs.
- Updated prepare and execute functions to use the new ScriptArgs parameters for file paths.
scripts/run_qwen3_4b.py
- Added data_dir, model_dir, megatron_path to ScriptArgs.
- Updated prepare and execute functions to use the new ScriptArgs parameters for file paths.
tools/convert_hf_to_torch_dist.py
- Reordered the writing of the 'release' checkpoint tracker file to occur after dist.barrier() to ensure proper synchronization in distributed environments.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a feature to automatically select the number of nodes for preparing Megatron checkpoints. The changes add a num_nodes parameter to exec_command_all_ray_node and convert_checkpoint functions, and run_deepseek.py is updated to use this for specific models. The implementation looks good. I've added one comment regarding an edge case where num_nodes could be zero, which would cause a crash.

miles/utils/misc.py

fzyzcjy added 3 commits February 15, 2026 20:26

fmt

d59ad8b

Merge branch 'feat/verify_done' into feat/generalize_num_nodes

dc698fe

# Conflicts: # scripts/run_deepseek.py

fzyzcjy requested review from guapisolo, maocheng23 and yueming-yuan as code owners February 15, 2026 14:42

fzyzcjy changed the base branch from main to feat/verify_done February 15, 2026 14:42

gemini-code-assist bot reviewed Feb 15, 2026

View reviewed changes

miles/utils/misc.py Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto pick num nodes in preparing megatron ckpt#607

Auto pick num nodes in preparing megatron ckpt#607
fzyzcjy wants to merge 3 commits intofeat/verify_donefrom
feat/generalize_num_nodes

fzyzcjy commented Feb 15, 2026

Uh oh!

gemini-code-assist bot commented Feb 15, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fzyzcjy commented Feb 15, 2026

Uh oh!

gemini-code-assist bot commented Feb 15, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant