Make large model training scripts one-click#601
Make large model training scripts one-click#601fzyzcjy wants to merge 49 commits intofeat/generalize_pathfrom
Conversation
Uses ray.remote function with NodeAffinitySchedulingStrategy to dispatch commands to every alive node in parallel. Clears CUDA_VISIBLE_DEVICES so subprocesses can access all GPUs without reserving Ray GPU resources.
- Replace exec_command with exec_command_all_ray_node - Use shell variable $SLURM_NODEID instead of Python-side interpolation so each node resolves its own rank - Move skip check into shell command so each node checks locally
Replaces {{node_rank}}, {{nnodes}}, {{master_addr}}, {{node_ip}}
per-node before dispatching the command.
Replace SLURM env var reading with {{master_addr}}, {{nnodes}},
{{node_rank}} placeholders resolved by exec_command_all_ray_node.
Ensures the node running the driver is always rank 0, with remaining nodes sorted by IP for deterministic ordering.
…ddress" This reverts commit 36e2ce1.
Summary of ChangesHello @fzyzcjy, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the capability for distributed command execution within a Ray cluster. By introducing a new utility, it streamlines the process of running shell commands across multiple nodes in parallel, offering greater flexibility through dynamic placeholders and improved GPU resource management. This change simplifies multi-node operations and reduces reliance on specific cluster management systems like SLURM for environment variable resolution. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces exec_command_all_ray_node, a valuable utility for executing commands across all nodes in a Ray cluster, and migrates convert_checkpoint and rsync_simple to use it, promoting a more general, Ray-native approach. However, this migration unfortunately preserves and expands the impact of pre-existing command injection vulnerabilities. Critical vulnerabilities exist where parameters are directly interpolated into shell command strings without sanitization, allowing for arbitrary code execution across the cluster if inputs are user-controlled. It is strongly recommended to use shlex.quote() for all variables injected into shell commands. Furthermore, two high-severity issues were identified in exec_command_all_ray_node that could lead to incorrect behavior or side effects in a multi-tasking environment.
Merge prepare_single, prepare_spmd, prepare_cp into private helpers called sequentially from the train command, removing separate CLI subcommands.
Same pattern as run_deepseek.py: merge prepare_single, prepare_spmd, prepare_cp into private helpers called from train. Add empty typer callback to show command names.
Use 'unset CUDA_VISIBLE_DEVICES' in the bash command to avoid persisting env changes in the Ray worker process, which could affect other tasks on the same worker.
Chained replace() calls can cause incorrect substitutions if a replacement value itself contains a placeholder string. Single-pass re.sub with a dictionary is safer.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors large model training scripts to be 'one-click' by consolidating steps into a single train command and introduces exec_command_all_ray_node for simplified multi-node operations. Critically, it introduces several command injection vulnerabilities by constructing shell commands with unsanitized input from CLI arguments and environment variables, especially in utility functions utilizing exec_command_all_ray_node. It is highly recommended to use shlex.quote() to sanitize all variables before inclusion in shell command strings to prevent arbitrary command execution. Furthermore, a critical issue exists where a script will fail with default arguments, and a minor performance improvement is suggested.
Should be args.mode, matching the pattern in run_deepseek.py.
This reverts commit 053c8c7.
…lusters When Ray is started externally (e.g. by msc --ray), the Python process needs to connect before using ray.nodes(). Disconnect after completion so the driver is released before execute_train runs.
- Generalize _prepare_megatron_ckpt to use layer count from model name instead of hard-coded model names. 5-layer with num_nodes>1 now uses EP=4 (was falling to full-model PP=8 branch). - Use absolute path for tools/convert_hf_to_torch_dist.py in convert_checkpoint so it resolves correctly from Ray worker cwd.
Work around flashinfer/flashinfer-jit-cache version mismatch (0.6.3 vs 0.6.1+cu129) in the container environment.
The converter asserts world_size <= num_layers. For 5-layer model on 2 nodes (world_size=8 > 5), fall back to single-node conversion. Also make convert_checkpoint use exec_command (head only) when multinode=False instead of running on all nodes.
Auto PP detection causes expert_tensor_model_pipeline_parallel to exceed world_size when EP=4 on single-node conversion.
The converter auto-increases PP when PP=1 and world_size>1, which conflicts with EP>1 (EP*PP exceeds world_size). For small models (num_layers extracted from name), convert on 1 GPU with PP=1/EP=1. Training reshards automatically via --ref-load.
The converter's mbridge handles FP8 dequant internally. The BF16- converted model has a different safetensors structure that causes dequant_fp8_safetensor_io to miss model.embed_tokens.weight.
With single-GPU conversion, each node must convert locally since storage is node-local. Move skip-if-exists check into shell command so each node evaluates independently.
Read MILES_SCRIPT_MODEL_DIR, MILES_SCRIPT_DATA_DIR, etc. as defaults so downloads and checkpoints go to cluster-shared storage visible to all nodes. Revert convert_checkpoint to head-only for non-multinode since shared storage makes per-node conversion unnecessary.
Storage is container-local (no shared volumes), so each node must independently download the model, convert to bf16, and convert to megatron format. Datasets only needed on head node for ray job submit.
Since megatron checkpoint conversion now uses the original model (mbridge handles FP8 dequant internally), the intermediate bf16 checkpoint is unused.
- Fix fp8_cast_bf16.py path to absolute (repo_base_dir) - Restore bf16 as input for megatron checkpoint conversion - Use bf16 model for training --hf-checkpoint - Rsync bf16 model (not original) in _prepare_cp
Fix convert_hf_to_torch_dist.py and fp8_cast_bf16.py invocations to use repo_base_dir instead of relative paths, which fail when the working directory differs (e.g. Ray worker nodes).
ray job submit resolves relative paths from its own working directory (/workspace), not from the miles repo root. Convert train_script to an absolute path using repo_base_dir when needed.
When num_nodes > 1 and not using external Ray, use the msc Ray cluster to dispatch background restart commands that recreate the Ray cluster with GPU resources on all nodes.
The msc tool sets MILES_SCRIPT_EXTERNAL_RAY=1 but creates a Ray cluster with --num-gpus 0 (for command dispatch only). Training needs GPU resources, so always restart Ray for multi-node regardless of the external_ray flag.
Use single-GPU single-node conversion for DeepSeek-V3-0324-5layer to avoid world_size > num_layers assertion, and dispatch via exec_command (not exec_command_all_ray_node) when multinode=False to prevent race conditions on shared storage.
Ray GPU allocation will be handled by msc instead of restarting Ray inside the training script.
The converter auto-derives PP=2 when world_size=4, matching the original single-node behavior. min(4, num_gpus_per_node) also handles GB200 (4 GPUs/node) correctly.
No description provided.