Add support for Lepton and DGX Cloud executors#162
Open
rapaul-nv wants to merge 8 commits intoNVIDIA-NeMo:mainfrom
Open
Add support for Lepton and DGX Cloud executors#162rapaul-nv wants to merge 8 commits intoNVIDIA-NeMo:mainfrom
rapaul-nv wants to merge 8 commits intoNVIDIA-NeMo:mainfrom
Conversation
Extends Nemotron executor layer to run on NVIDIA DGX Cloud Lepton and
DGX Cloud (run:ai) alongside the existing Slurm path, including multi-node
Ray-based RL (GRPO) via nemo-run's unified RayCluster + RayJob classes.
Changes
-------
* nemo-run pinned to the PR #480 build of rapaul-nv/Run (adds
DGXCloudRayJob / LeptonRayCluster backends + client_credentials auth).
* New nemo_runspec.execution:
- execute_cloud : inline Script submission for non-Ray cloud jobs
(data prep, torchrun-based SFT/pretrain).
- execute_cloud_ray: RayCluster + RayJob path for launch="ray" recipes;
used by nano3/super3 RL commands when on Lepton or
DGX Cloud with nodes > 1.
- _create_lepton_executor / _create_dgxcloud_executor factories, aligned
with nemo-run 0.10 (client_id/client_secret, kube_apiserver_url,
ray_version).
- get_executor_type helper shared across all CLI command dispatchers.
* CLI commands (nano3 + super3 sft/pretrain/rl + data/prep) route cloud
jobs through the new execute_cloud / execute_cloud_ray paths.
* env.toml profiles: [lepton], [lepton_gcp], [lepton_sft*], [lepton_rl],
[dgxcloud] plus a shared [wandb] section.
* Pipeline config accepts "dgxcloud"/"lepton" literals.
* Tests: tests/nemo_runspec/test_execution.py covers executor creation
(local/docker/slurm/lepton/dgxcloud), OmegaConf <-> plain-dict conversion,
cloud workspace derivation, and git-mount command generation.
Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
Airgap-friendly source transport for Lepton and DGX Cloud executors: local src/ is tarballed and delivered over the same Job API the launcher already uses, replacing the old `pip install nemotron @ git+…` step. - Lepton: chunk tarball across env vars (envp bypasses the 128 KiB argv MAX_ARG_STRLEN). NODE_RANK=0 extracts on shared NFS; other pods wait on a chunk-count-suffixed marker file to avoid the multi-pod rm-rf-then-tar race. - DGX Cloud (run:AI): drop tarball into job_dir as one `.tgz` file so move_data chunks only that file — torchrun_job.sh stays small. Env-var chunks would otherwise blow up the per-file deploy count. - Fallback (Slurm / others): nemo-run native packager extraction. Infrastructure: - New nemo_runspec.data_mover module with SourcePackager, Plan, plan_for. - run.py: patch_cloud_data_mover_skip_configs() excludes configs/ from nemo-run's inline-base64 tarball on both Lepton and DGXCloud, keeping the data-mover command under the kernel's MAX_ARG_STRLEN. - execute_cloud / execute_cloud_ray dispatch through a single plan_for() call; removed ~250 lines of branching packager/env-var/flag code. - env.toml knobs: `repo_root` (repo override), `dgxcloud_max_args_chars` (tune run:AI per-workload Args budget, default 9500). Tests: tests/nemo_runspec/test_data_mover.py covers include scoping, pycache filtering, all three transport branches, and the NODE_RANK gate. Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
Slurm, Lepton, and DGX Cloud now share the same factory pattern. The
previous ``create_executor`` had ~80 lines of inline Slurm setup while
Lepton and DGX Cloud already used dedicated helpers. Five CLI commands
also maintained near-identical copies of ``_build_slurm_executor``.
Changes
-------
* Extract ``create_slurm_executor`` as a public helper (with a
``launcher`` kwarg so Ray-based flows can pass ``None``). Also extract
``_create_local_executor`` and ``_create_docker_executor``.
* Add ``_resolve_container_image`` and ``_resolve_nodes_gpus`` shared
helpers that replace the three duplicated resolution blocks.
* ``create_executor`` is now a small dispatch function.
* Accept an ``identity`` field in env.toml so the SSH tunnel can be
given an explicit private-key path when the agent isn't available.
Bug fix for Slurm auto-mounts
-----------------------------
On Slurm, ``${auto_mount:git+...}`` entries in a recipe config only
register the requested repo into the ``get_git_mounts()`` registry at
the moment OmegaConf resolves them. The previous code cloned the repos
*before* the mounts list was accessed, so the registry was empty and
the resulting sbatch script was missing the bind mounts. The container
then silently fell back to the pre-baked Megatron-LM / Megatron-Bridge
versions, causing training to run against the wrong code.
``_create_slurm_executor`` now reads ``env.mounts`` before cloning,
which triggers resolution and populates the registry.
CLI commands
------------
The following commands no longer carry their own Slurm executor setup
and delegate to ``create_slurm_executor`` (with ``launcher=None`` for
Ray flows):
* ``nano3/data/prep/sft.py``
* ``nano3/data/prep/rl.py``
* ``nano3/data/prep/pretrain.py``
* ``nano3/rl.py``
* ``super3/rl/_base.py`` (persistent_cache and sif_dir extras applied
on top of the returned executor)
Misc
----
* Fix stale ``.gitignore`` whitelist: the CLI data directory moved from
``src/nemotron/cli/nano3/data/`` to
``src/nemotron/cli/commands/nano3/data/`` but the override entry was
never updated, so the files were silently ignored.
Verification
------------
* 131/131 unit tests pass.
* Lint clean on all touched files.
* End-to-end Slurm SFT data prep succeeds on CW DFW CS-001 (2 nodes,
10k samples, 47M tokens).
Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
- Updated the `nemo-run` dependency to point to the main branch of the NVIDIA-NeMo repository. - Enhanced the `SourcePackager` class to streamline tarball creation for job directories. - Refactored the `plan_for` function to unify environment variable chunking for Lepton and DGX Cloud executors, improving data transport efficiency. - Introduced new patches to handle legacy keyword arguments and optimize environment variable handling in DGX Cloud. - Added a manifest resolver to facilitate configuration management for data preparation outputs. This commit ensures better compatibility with upstream changes and enhances the overall performance of the data mover functionality across different execution environments.
- Added support for executing data preparation commands (pretrain, rl, sft) on DGX Cloud and Lepton using the new `execute_cloud` function. - Implemented logic to handle single-node CPU work for cloud environments, ensuring compatibility with the existing data preparation workflow. - Introduced functions to create train/val splits and update the manifest with per-split paths in the RL data preparation process. This update improves the flexibility and efficiency of data preparation across different execution environments.
marcromeyn
requested changes
Apr 27, 2026
- docs: document required kube_apiserver_url; promote client_id/client_secret as primary auth fields with app_id/app_secret as legacy aliases - run.py: use argv list form for tar to avoid shell-metachar issues in nemo_run_dir paths - execution.py: honor explicit gpus_per_node=0 when picking nproc - execution.py: rename shadowed local launch -> launch_cmd - execution.py: narrow cluster.start except to "already exists" idempotency, re-raise auth/quota/image-pull failures Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Why is the |
Author
|
@marcromeyn Updated the PR description as it was old and not well summarized! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Brings nano3 and super3 recipes onto NVIDIA DGX Cloud Lepton and DGX Cloud (run:ai) alongside the existing Slurm path — including data prep, SFT/pretrain (torchrun), and multi-node Ray GRPO RL. Replaces
pip install … @ git+…on cloud pods with a unified local-source transport, refactors the Slurm executor into a shared factory (fixing a silent auto-mount bug), wires super3 data-prep through the cloud path, and bumpsnemo-runto upstreammain.What this PR delivers
1. Lepton + DGX Cloud executor support
A new cloud execution path through
nemo_runspec.execution:execute_cloud— inlinerun.Scriptsubmission for non-Ray cloud jobs (data prep, torchrun SFT/pretrain).execute_cloud_ray—RayCluster+RayJobforlaunch="ray"recipes; used by nano3/super3 RL on Lepton/DGX Cloud atnodes > 1.client_id/client_secret,kube_apiserver_url,ray_version).get_executor_typehelper for CLI dispatchers; pipeline configs accept"lepton"/"dgxcloud"literals.sft,pretrain,rl,data/prep/*) all dispatch through this.2. Local-source transport for cloud pods
A new
nemo_runspec.data_movermodule replacespip install nemotron @ git+…with a single tarball-shipping flow. One function,plan_for(executor_type, ...), returns aPlandescribing how to deliversrc/to the pod:/nemo_run/code/srcpython3 -c '…' | tar -xzMAX_ARG_STRLEN; env vars bypass itBoth cloud paths share the same code path; only the chunk size differs. The pod-side reassembly script is identical.
Multi-pod NFS race fix (Lepton/DGX Cloud): when N pods share the same destination on NFS, each would otherwise
rm -rf && tar -xzconcurrently and clobber each other.NODE_RANK=0extracts and drops a marker; other ranks wait on it. The marker name is suffixed with the chunk count so stale markers from prior runs can't mislead waiters.Auto-discovery of what to ship (
_auto_includes): walks<repo>/src/*, ships every top-level package, and for packages with arecipes/subdir ships only the active recipe family inferred from the script path (e.g.nano3) — keeps the tarball small.Three nemo-run patches keep this working:
patch_cloud_data_mover_skip_configs()— excludeconfigs/from nemo-run's inline-base64 tarball (otherwise the data-mover command exceeds kernel argv limits).patch_dgxcloud_strip_source_chunks_from_exports()— keep chunks in run:AI's structuredenvironmentVariablesfield instead of letting them get re-baked intotorchrun_job.sh(otherwisemove_dataspends ~12 min chunking the script).patch_dgxcloud_accept_legacy_kwargs()— silence theapp_idkeyword warning fiddle prints on every status poll.Net code impact:
execute_cloud/execute_cloud_raydispatch through a singleplan_for()call, removing ~250 lines of branching packager / env-var code that previously lived inline inexecution.py.New
env.tomlknobs:repo_root(override repo location),dgxcloud_max_args_chars(run:AI per-workloadArgsbudget, default 9500).3. Slurm executor refactor + auto-mount fix
Slurm, Lepton, and DGX Cloud previously had three different code paths for executor construction; they now share one factory pattern in
execution.py:create_slurm_executor(with alauncherkwarg so Ray flows can passNone), plus_create_local_executor/_create_docker_executor._resolve_container_imageand_resolve_nodes_gpushelpers replace three duplicated resolution blocks.create_executorbecomes a small dispatcher.nano3/data/prep/{sft,rl,pretrain}.py,nano3/rl.py,super3/rl/_base.py) drop their ~80-line inline Slurm setup and delegate.Auto-mount bug fix (Slurm):
${auto_mount:git+…}entries register themselves intoget_git_mounts()only at OmegaConf resolution time. The previous code cloned repos before themountslist was accessed, so the registry was empty whenclone_git_repos_via_tunnelran — the resulting sbatch script was missing its bind-mounts and the container silently fell back to pre-baked Megatron-LM / Megatron-Bridge, training against the wrong code.create_slurm_executornow readsenv.mountsbefore the tunnel's clone step, forcing OmegaConf to walk the list and populate the registry.New
env.tomlfield:identity— explicit SSH private-key path for the tunnel when the SSH agent isn't available.4. super3 data-prep on cloud executors
data prep {pretrain, rl, sft}now route throughexecute_cloud, matching the nano3 path; single-node CPU work is supported on cloud executors.5. Dependency + polish
nemo-runpinned to NVIDIA-NeMomain.SourcePackagerstreamlined;plan_forunifies env-var chunking across Lepton and DGX Cloud.nemo_runspec/config/resolvers.py)..gitignorewhitelist fix: the CLI data dir moved fromsrc/nemotron/cli/nano3/data/tosrc/nemotron/cli/commands/nano3/data/; override entry updated (files were silently ignored before).Breaking / behavioral changes
pip installs from git. Consumers depending on the old git-install path (custom dockerfiles, CI cache) need to adjust.${auto_mount:git+…}versions. Expect real behavior changes; re-baseline numbers for affected jobs.cluster.starterrors no longer swallowed — auth, quota, and image-pull failures propagate. CI or scripts that relied on silent retries may need updating.gpus_per_node=0is now honored when pickingnproc(CPU-only cloud workloads no longer silently get a GPUnprocvalue).nemo-runpin updated — verify downstream consumers pin compatibly.env.tomlfields (identity,repo_root,dgxcloud_max_args_chars) are additive and documented indocs/nemo_runspec/nemo-run.md.Test plan
pytest tests/nemo_runspec/test_execution.py tests/nemo_runspec/test_data_mover.pypassesNODE_RANK=0marker gates the othersenvironmentVariables(nottorchrun_job.sh);Argsstays under budgetnodes > 1goes throughexecute_cloud_ray${auto_mount:git+…}(auto-mount fix regression guard)train/+val/splits and updated manifestgpus_per_node=0picks the rightnproc(CPU-only regression guard)