Make large model training scripts one-click by fzyzcjy · Pull Request #601 · radixark/miles

fzyzcjy · 2026-02-14T12:34:32Z

No description provided.

Uses ray.remote function with NodeAffinitySchedulingStrategy to dispatch commands to every alive node in parallel. Clears CUDA_VISIBLE_DEVICES so subprocesses can access all GPUs without reserving Ray GPU resources.

- Replace exec_command with exec_command_all_ray_node - Use shell variable $SLURM_NODEID instead of Python-side interpolation so each node resolves its own rank - Move skip check into shell command so each node checks locally

Replaces {{node_rank}}, {{nnodes}}, {{master_addr}}, {{node_ip}} per-node before dispatching the command.

Replace SLURM env var reading with {{master_addr}}, {{nnodes}}, {{node_rank}} placeholders resolved by exec_command_all_ray_node.

Ensures the node running the driver is always rank 0, with remaining nodes sorted by IP for deterministic ordering.

…ddress" This reverts commit 36e2ce1.

gemini-code-assist · 2026-02-14T12:34:46Z

Summary of Changes

Hello @fzyzcjy, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the capability for distributed command execution within a Ray cluster. By introducing a new utility, it streamlines the process of running shell commands across multiple nodes in parallel, offering greater flexibility through dynamic placeholders and improved GPU resource management. This change simplifies multi-node operations and reduces reliance on specific cluster management systems like SLURM for environment variable resolution.

Highlights

New Multi-Node Command Execution Utility: Introduced exec_command_all_ray_node in misc.py to dispatch shell commands to all alive Ray nodes in parallel.
Placeholder Support: The new utility supports per-node placeholders ({{node_rank}}, {{nnodes}}, {{master_addr}}, {{node_ip}}) which are replaced before command execution, ensuring the driver node is always rank 0.
CUDA_VISIBLE_DEVICES Handling: The CUDA_VISIBLE_DEVICES environment variable is cleared for subprocesses to allow access to all GPUs without reserving Ray GPU resources.
Migration of Existing Functions: Migrated convert_checkpoint and rsync_simple in command_utils.py to utilize the new exec_command_all_ray_node function, replacing previous SLURM environment variable parsing with the new placeholder system.

Changelog

miles/utils/external_utils/command_utils.py
- Imported exec_command_all_ray_node for multi-node command execution.
- Updated convert_checkpoint to use exec_command_all_ray_node and replaced SLURM environment variable parsing with dynamic placeholders for master_addr, nnodes, and node_rank.
- Modified rsync_simple to use exec_command_all_ray_node for parallel execution across all Ray nodes.
miles/utils/misc.py
- Added imports for os and NodeAffinitySchedulingStrategy.
- Implemented _exec_command_on_node as a Ray remote function that clears CUDA_VISIBLE_DEVICES before executing a command.
- Introduced exec_command_all_ray_node to execute a given shell command on all alive Ray nodes in parallel, supporting per-node placeholder replacement for node_rank, nnodes, master_addr, and node_ip.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces exec_command_all_ray_node, a valuable utility for executing commands across all nodes in a Ray cluster, and migrates convert_checkpoint and rsync_simple to use it, promoting a more general, Ray-native approach. However, this migration unfortunately preserves and expands the impact of pre-existing command injection vulnerabilities. Critical vulnerabilities exist where parameters are directly interpolated into shell command strings without sanitization, allowing for arbitrary code execution across the cluster if inputs are user-controlled. It is strongly recommended to use shlex.quote() for all variables injected into shell commands. Furthermore, two high-severity issues were identified in exec_command_all_ray_node that could lead to incorrect behavior or side effects in a multi-tasking environment.

miles/utils/external_utils/command_utils.py

miles/utils/misc.py

Merge prepare_single, prepare_spmd, prepare_cp into private helpers called sequentially from the train command, removing separate CLI subcommands.

Same pattern as run_deepseek.py: merge prepare_single, prepare_spmd, prepare_cp into private helpers called from train. Add empty typer callback to show command names.

Use 'unset CUDA_VISIBLE_DEVICES' in the bash command to avoid persisting env changes in the Ray worker process, which could affect other tasks on the same worker.

Chained replace() calls can cause incorrect substitutions if a replacement value itself contains a placeholder string. Single-pass re.sub with a dictionary is safer.

fzyzcjy · 2026-02-14T12:46:03Z

/gemini review

gemini-code-assist

Code Review

This pull request refactors large model training scripts to be 'one-click' by consolidating steps into a single train command and introduces exec_command_all_ray_node for simplified multi-node operations. Critically, it introduces several command injection vulnerabilities by constructing shell commands with unsanitized input from CLI arguments and environment variables, especially in utility functions utilizing exec_command_all_ray_node. It is highly recommended to use shlex.quote() to sanitize all variables before inclusion in shell command strings to prevent arbitrary command execution. Furthermore, a critical issue exists where a script will fail with default arguments, and a minor performance improvement is suggested.

scripts/run_glm45_355b_a32b.py

miles/utils/external_utils/command_utils.py

miles/utils/misc.py

Should be args.mode, matching the pattern in run_deepseek.py.

This reverts commit 053c8c7.

…lusters When Ray is started externally (e.g. by msc --ray), the Python process needs to connect before using ray.nodes(). Disconnect after completion so the driver is released before execute_train runs.

- Generalize _prepare_megatron_ckpt to use layer count from model name instead of hard-coded model names. 5-layer with num_nodes>1 now uses EP=4 (was falling to full-model PP=8 branch). - Use absolute path for tools/convert_hf_to_torch_dist.py in convert_checkpoint so it resolves correctly from Ray worker cwd.

Work around flashinfer/flashinfer-jit-cache version mismatch (0.6.3 vs 0.6.1+cu129) in the container environment.

The converter asserts world_size <= num_layers. For 5-layer model on 2 nodes (world_size=8 > 5), fall back to single-node conversion. Also make convert_checkpoint use exec_command (head only) when multinode=False instead of running on all nodes.

Auto PP detection causes expert_tensor_model_pipeline_parallel to exceed world_size when EP=4 on single-node conversion.

The converter auto-increases PP when PP=1 and world_size>1, which conflicts with EP>1 (EP*PP exceeds world_size). For small models (num_layers extracted from name), convert on 1 GPU with PP=1/EP=1. Training reshards automatically via --ref-load.

The converter's mbridge handles FP8 dequant internally. The BF16- converted model has a different safetensors structure that causes dequant_fp8_safetensor_io to miss model.embed_tokens.weight.

With single-GPU conversion, each node must convert locally since storage is node-local. Move skip-if-exists check into shell command so each node evaluates independently.

Read MILES_SCRIPT_MODEL_DIR, MILES_SCRIPT_DATA_DIR, etc. as defaults so downloads and checkpoints go to cluster-shared storage visible to all nodes. Revert convert_checkpoint to head-only for non-multinode since shared storage makes per-node conversion unnecessary.

Storage is container-local (no shared volumes), so each node must independently download the model, convert to bf16, and convert to megatron format. Datasets only needed on head node for ray job submit.

Since megatron checkpoint conversion now uses the original model (mbridge handles FP8 dequant internally), the intermediate bf16 checkpoint is unused.

- Fix fp8_cast_bf16.py path to absolute (repo_base_dir) - Restore bf16 as input for megatron checkpoint conversion - Use bf16 model for training --hf-checkpoint - Rsync bf16 model (not original) in _prepare_cp

Fix convert_hf_to_torch_dist.py and fp8_cast_bf16.py invocations to use repo_base_dir instead of relative paths, which fail when the working directory differs (e.g. Ray worker nodes).

ray job submit resolves relative paths from its own working directory (/workspace), not from the miles repo root. Convert train_script to an absolute path using repo_base_dir when needed.

When num_nodes > 1 and not using external Ray, use the msc Ray cluster to dispatch background restart commands that recreate the Ray cluster with GPU resources on all nodes.

The msc tool sets MILES_SCRIPT_EXTERNAL_RAY=1 but creates a Ray cluster with --num-gpus 0 (for command dispatch only). Training needs GPU resources, so always restart Ray for multi-node regardless of the external_ray flag.

Use single-GPU single-node conversion for DeepSeek-V3-0324-5layer to avoid world_size > num_layers assertion, and dispatch via exec_command (not exec_command_all_ray_node) when multinode=False to prevent race conditions on shared storage.

Ray GPU allocation will be handled by msc instead of restarting Ray inside the training script.

The converter auto-derives PP=2 when world_size=4, matching the original single-node behavior. min(4, num_gpus_per_node) also handles GB200 (4 GPUs/node) correctly.

fzyzcjy added 13 commits February 14, 2026 20:22

Add exec_command_all_ray_node to run shell commands on all Ray nodes

a2384c2

Uses ray.remote function with NodeAffinitySchedulingStrategy to dispatch commands to every alive node in parallel. Clears CUDA_VISIBLE_DEVICES so subprocesses can access all GPUs without reserving Ray GPU resources.

Use global imports in _exec_command_on_node

4871003

Use exec_command_all_ray_node in rsync_simple

9dc0d20

Use exec_command_all_ray_node in convert_checkpoint

1e1956b

- Replace exec_command with exec_command_all_ray_node - Use shell variable $SLURM_NODEID instead of Python-side interpolation so each node resolves its own rank - Move skip check into shell command so each node checks locally

Restore driver-side skip check in convert_checkpoint

18abd23

Move exec_command_all_ray_node to misc.py next to exec_command

290b375

Support per-node placeholders in exec_command_all_ray_node

83b066d

Replaces {{node_rank}}, {{nnodes}}, {{master_addr}}, {{node_ip}} per-node before dispatching the command.

Simplify convert_checkpoint multinode args with placeholders

e0c3b37

Replace SLURM env var reading with {{master_addr}}, {{nnodes}}, {{node_rank}} placeholders resolved by exec_command_all_ray_node.

Use keyword argument for capture_output in remote call

27c365e

Sort nodes in exec_command_all_ray_node: driver node as rank 0

4933fff

Ensures the node running the driver is always rank 0, with remaining nodes sorted by IP for deterministic ordering.

Use driver IP directly as master_addr instead of NodeManagerAddress

36e2ce1

Revert "Use driver IP directly as master_addr instead of NodeManagerA…

4ccc7e0

…ddress" This reverts commit 36e2ce1.

Fix black formatting in convert_checkpoint multinode_args

fec2a75

fzyzcjy requested review from guapisolo, maocheng23 and yueming-yuan as code owners February 14, 2026 12:34

fzyzcjy changed the title ~~Add exec_command_all_ray_node for multi-node command execution~~ Make large model training scripts one-click Feb 14, 2026

gemini-code-assist bot reviewed Feb 14, 2026

View reviewed changes

miles/utils/external_utils/command_utils.py Show resolved Hide resolved

miles/utils/misc.py Outdated Show resolved Hide resolved

miles/utils/misc.py Outdated Show resolved Hide resolved

fzyzcjy added 8 commits February 14, 2026 20:38

Consolidate prepare steps into train command in run_deepseek.py

ec81965

Merge prepare_single, prepare_spmd, prepare_cp into private helpers called sequentially from the train command, removing separate CLI subcommands.

Add empty typer callback to run_deepseek.py to show command names

b44a778

Consolidate prepare steps into train command in run_glm45_355b_a32b.py

0d49a6d

Same pattern as run_deepseek.py: merge prepare_single, prepare_spmd, prepare_cp into private helpers called from train. Add empty typer callback to show command names.

Move @app.callback to bottom of file, before __main__ guard

e556584

Extract _execute_train from train in run_deepseek.py

40f87f6

Extract _execute_train from train in run_glm45_355b_a32b.py

2fc7708

Unset CUDA_VISIBLE_DEVICES in shell instead of modifying os.environ

e03630d

Use 'unset CUDA_VISIBLE_DEVICES' in the bash command to avoid persisting env changes in the Ray worker process, which could affect other tasks on the same worker.

Use single-pass regex substitution for placeholders

cc3690b

Chained replace() calls can cause incorrect substitutions if a replacement value itself contains a placeholder string. Single-pass re.sub with a dictionary is safer.

Fix ruff B023: bind loop variable in lambda via default argument

fe794d3

gemini-code-assist bot reviewed Feb 14, 2026

View reviewed changes

scripts/run_glm45_355b_a32b.py Show resolved Hide resolved

miles/utils/external_utils/command_utils.py Outdated Show resolved Hide resolved

miles/utils/external_utils/command_utils.py Show resolved Hide resolved

miles/utils/misc.py Outdated Show resolved Hide resolved

fzyzcjy added 27 commits February 14, 2026 20:54

Fix missing space before extra_args in convert_checkpoint command

d02e0b7

Fix reference to non-existent args.true_on_policy in run_glm45

053c8c7

Should be args.mode, matching the pattern in run_deepseek.py.

Revert "Fix reference to non-existent args.true_on_policy in run_glm45"

8400ad9

This reverts commit 053c8c7.

Add ray.init/shutdown in exec_command_all_ray_node for external Ray c…

8a5ea97

…lusters When Ray is started externally (e.g. by msc --ray), the Python process needs to connect before using ray.nodes(). Disconnect after completion so the driver is released before execute_train runs.

Set FLASHINFER_DISABLE_VERSION_CHECK=1 in convert_checkpoint

2791d1a

Work around flashinfer/flashinfer-jit-cache version mismatch (0.6.3 vs 0.6.1+cu129) in the container environment.

Explicitly set PP=1 for small-layer model checkpoint conversion

9e039a6

Auto PP detection causes expert_tensor_model_pipeline_parallel to exceed world_size when EP=4 on single-node conversion.

Use original model (not BF16) for megatron checkpoint conversion

f9ac8b1

The converter's mbridge handles FP8 dequant internally. The BF16- converted model has a different safetensors structure that causes dequant_fp8_safetensor_io to miss model.embed_tokens.weight.

Run checkpoint conversion on all nodes independently

acb5e9c

With single-GPU conversion, each node must convert locally since storage is node-local. Move skip-if-exists check into shell command so each node evaluates independently.

Run model download, bf16, and ckpt conversion on all nodes

6d53467

Storage is container-local (no shared volumes), so each node must independently download the model, convert to bf16, and convert to megatron format. Datasets only needed on head node for ray job submit.

Skip bf16 conversion step - no longer needed

2631d4e

Since megatron checkpoint conversion now uses the original model (mbridge handles FP8 dequant internally), the intermediate bf16 checkpoint is unused.

Restore bf16 pipeline for convert and train

9690fe3

- Fix fp8_cast_bf16.py path to absolute (repo_base_dir) - Restore bf16 as input for megatron checkpoint conversion - Use bf16 model for training --hf-checkpoint - Rsync bf16 model (not original) in _prepare_cp

revert all

72313e9

Use absolute paths for tools scripts in command_utils.py

693578e

Fix convert_hf_to_torch_dist.py and fp8_cast_bf16.py invocations to use repo_base_dir instead of relative paths, which fail when the working directory differs (e.g. Ray worker nodes).

Resolve train_script to absolute path in execute_train

4bed06f

ray job submit resolves relative paths from its own working directory (/workspace), not from the miles repo root. Convert train_script to an absolute path using repo_base_dir when needed.

Add multi-node Ray restart in execute_train

d868b89

When num_nodes > 1 and not using external Ray, use the msc Ray cluster to dispatch background restart commands that recreate the Ray cluster with GPU resources on all nodes.

Fix multinode Ray restart to ignore MILES_SCRIPT_EXTERNAL_RAY

d4e0fcf

The msc tool sets MILES_SCRIPT_EXTERNAL_RAY=1 but creates a Ray cluster with --num-gpus 0 (for command dispatch only). Training needs GPU resources, so always restart Ray for multi-node regardless of the external_ray flag.

Remove _restart_ray_multinode from execute_train

f390b49

Ray GPU allocation will be handled by msc instead of restarting Ray inside the training script.

more

507c8fb

Use 4 GPUs for 5-layer conversion instead of 1

11de08c

The converter auto-derives PP=2 when world_size=4, matching the original single-node behavior. min(4, num_gpus_per_node) also handles GB200 (4 GPUs/node) correctly.

more

ea6e04b

more

40e8e17

more

32ae842

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make large model training scripts one-click#601

Make large model training scripts one-click#601
fzyzcjy wants to merge 49 commits intofeat/generalize_pathfrom
worktree/ac8330

fzyzcjy commented Feb 14, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Feb 14, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fzyzcjy commented Feb 14, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fzyzcjy commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Feb 14, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fzyzcjy commented Feb 14, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fzyzcjy commented Feb 14, 2026 •

edited

Loading