Skip to content

Make large model training scripts one-click#601

Open
fzyzcjy wants to merge 49 commits intofeat/generalize_pathfrom
worktree/ac8330
Open

Make large model training scripts one-click#601
fzyzcjy wants to merge 49 commits intofeat/generalize_pathfrom
worktree/ac8330

Conversation

@fzyzcjy
Copy link
Collaborator

@fzyzcjy fzyzcjy commented Feb 14, 2026

No description provided.

Uses ray.remote function with NodeAffinitySchedulingStrategy to
dispatch commands to every alive node in parallel. Clears
CUDA_VISIBLE_DEVICES so subprocesses can access all GPUs without
reserving Ray GPU resources.
- Replace exec_command with exec_command_all_ray_node
- Use shell variable $SLURM_NODEID instead of Python-side interpolation
  so each node resolves its own rank
- Move skip check into shell command so each node checks locally
Replaces {{node_rank}}, {{nnodes}}, {{master_addr}}, {{node_ip}}
per-node before dispatching the command.
Replace SLURM env var reading with {{master_addr}}, {{nnodes}},
{{node_rank}} placeholders resolved by exec_command_all_ray_node.
Ensures the node running the driver is always rank 0, with
remaining nodes sorted by IP for deterministic ordering.
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @fzyzcjy, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the capability for distributed command execution within a Ray cluster. By introducing a new utility, it streamlines the process of running shell commands across multiple nodes in parallel, offering greater flexibility through dynamic placeholders and improved GPU resource management. This change simplifies multi-node operations and reduces reliance on specific cluster management systems like SLURM for environment variable resolution.

Highlights

  • New Multi-Node Command Execution Utility: Introduced exec_command_all_ray_node in misc.py to dispatch shell commands to all alive Ray nodes in parallel.
  • Placeholder Support: The new utility supports per-node placeholders ({{node_rank}}, {{nnodes}}, {{master_addr}}, {{node_ip}}) which are replaced before command execution, ensuring the driver node is always rank 0.
  • CUDA_VISIBLE_DEVICES Handling: The CUDA_VISIBLE_DEVICES environment variable is cleared for subprocesses to allow access to all GPUs without reserving Ray GPU resources.
  • Migration of Existing Functions: Migrated convert_checkpoint and rsync_simple in command_utils.py to utilize the new exec_command_all_ray_node function, replacing previous SLURM environment variable parsing with the new placeholder system.
Changelog
  • miles/utils/external_utils/command_utils.py
    • Imported exec_command_all_ray_node for multi-node command execution.
    • Updated convert_checkpoint to use exec_command_all_ray_node and replaced SLURM environment variable parsing with dynamic placeholders for master_addr, nnodes, and node_rank.
    • Modified rsync_simple to use exec_command_all_ray_node for parallel execution across all Ray nodes.
  • miles/utils/misc.py
    • Added imports for os and NodeAffinitySchedulingStrategy.
    • Implemented _exec_command_on_node as a Ray remote function that clears CUDA_VISIBLE_DEVICES before executing a command.
    • Introduced exec_command_all_ray_node to execute a given shell command on all alive Ray nodes in parallel, supporting per-node placeholder replacement for node_rank, nnodes, master_addr, and node_ip.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@fzyzcjy fzyzcjy changed the title Add exec_command_all_ray_node for multi-node command execution Make large model training scripts one-click Feb 14, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces exec_command_all_ray_node, a valuable utility for executing commands across all nodes in a Ray cluster, and migrates convert_checkpoint and rsync_simple to use it, promoting a more general, Ray-native approach. However, this migration unfortunately preserves and expands the impact of pre-existing command injection vulnerabilities. Critical vulnerabilities exist where parameters are directly interpolated into shell command strings without sanitization, allowing for arbitrary code execution across the cluster if inputs are user-controlled. It is strongly recommended to use shlex.quote() for all variables injected into shell commands. Furthermore, two high-severity issues were identified in exec_command_all_ray_node that could lead to incorrect behavior or side effects in a multi-tasking environment.

Merge prepare_single, prepare_spmd, prepare_cp into private helpers
called sequentially from the train command, removing separate CLI
subcommands.
Same pattern as run_deepseek.py: merge prepare_single, prepare_spmd,
prepare_cp into private helpers called from train. Add empty typer
callback to show command names.
Use 'unset CUDA_VISIBLE_DEVICES' in the bash command to avoid
persisting env changes in the Ray worker process, which could
affect other tasks on the same worker.
Chained replace() calls can cause incorrect substitutions if a
replacement value itself contains a placeholder string. Single-pass
re.sub with a dictionary is safer.
@fzyzcjy
Copy link
Collaborator Author

fzyzcjy commented Feb 14, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors large model training scripts to be 'one-click' by consolidating steps into a single train command and introduces exec_command_all_ray_node for simplified multi-node operations. Critically, it introduces several command injection vulnerabilities by constructing shell commands with unsanitized input from CLI arguments and environment variables, especially in utility functions utilizing exec_command_all_ray_node. It is highly recommended to use shlex.quote() to sanitize all variables before inclusion in shell command strings to prevent arbitrary command execution. Furthermore, a critical issue exists where a script will fail with default arguments, and a minor performance improvement is suggested.

Should be args.mode, matching the pattern in run_deepseek.py.
…lusters

When Ray is started externally (e.g. by msc --ray), the Python process
needs to connect before using ray.nodes(). Disconnect after completion
so the driver is released before execute_train runs.
- Generalize _prepare_megatron_ckpt to use layer count from model name
  instead of hard-coded model names. 5-layer with num_nodes>1 now uses
  EP=4 (was falling to full-model PP=8 branch).
- Use absolute path for tools/convert_hf_to_torch_dist.py in
  convert_checkpoint so it resolves correctly from Ray worker cwd.
Work around flashinfer/flashinfer-jit-cache version mismatch
(0.6.3 vs 0.6.1+cu129) in the container environment.
The converter asserts world_size <= num_layers. For 5-layer model on
2 nodes (world_size=8 > 5), fall back to single-node conversion.
Also make convert_checkpoint use exec_command (head only) when
multinode=False instead of running on all nodes.
Auto PP detection causes expert_tensor_model_pipeline_parallel to
exceed world_size when EP=4 on single-node conversion.
The converter auto-increases PP when PP=1 and world_size>1, which
conflicts with EP>1 (EP*PP exceeds world_size). For small models
(num_layers extracted from name), convert on 1 GPU with PP=1/EP=1.
Training reshards automatically via --ref-load.
The converter's mbridge handles FP8 dequant internally. The BF16-
converted model has a different safetensors structure that causes
dequant_fp8_safetensor_io to miss model.embed_tokens.weight.
With single-GPU conversion, each node must convert locally since
storage is node-local. Move skip-if-exists check into shell command
so each node evaluates independently.
Read MILES_SCRIPT_MODEL_DIR, MILES_SCRIPT_DATA_DIR, etc. as defaults
so downloads and checkpoints go to cluster-shared storage visible to
all nodes. Revert convert_checkpoint to head-only for non-multinode
since shared storage makes per-node conversion unnecessary.
Storage is container-local (no shared volumes), so each node must
independently download the model, convert to bf16, and convert to
megatron format. Datasets only needed on head node for ray job submit.
Since megatron checkpoint conversion now uses the original model
(mbridge handles FP8 dequant internally), the intermediate bf16
checkpoint is unused.
- Fix fp8_cast_bf16.py path to absolute (repo_base_dir)
- Restore bf16 as input for megatron checkpoint conversion
- Use bf16 model for training --hf-checkpoint
- Rsync bf16 model (not original) in _prepare_cp
Fix convert_hf_to_torch_dist.py and fp8_cast_bf16.py invocations
to use repo_base_dir instead of relative paths, which fail when
the working directory differs (e.g. Ray worker nodes).
ray job submit resolves relative paths from its own working
directory (/workspace), not from the miles repo root. Convert
train_script to an absolute path using repo_base_dir when needed.
When num_nodes > 1 and not using external Ray, use the msc Ray
cluster to dispatch background restart commands that recreate
the Ray cluster with GPU resources on all nodes.
The msc tool sets MILES_SCRIPT_EXTERNAL_RAY=1 but creates a Ray
cluster with --num-gpus 0 (for command dispatch only). Training
needs GPU resources, so always restart Ray for multi-node
regardless of the external_ray flag.
Use single-GPU single-node conversion for DeepSeek-V3-0324-5layer
to avoid world_size > num_layers assertion, and dispatch via
exec_command (not exec_command_all_ray_node) when multinode=False
to prevent race conditions on shared storage.
Ray GPU allocation will be handled by msc instead of restarting
Ray inside the training script.
The converter auto-derives PP=2 when world_size=4, matching the
original single-node behavior. min(4, num_gpus_per_node) also
handles GB200 (4 GPUs/node) correctly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant