Add support for Lepton and DGX Cloud executors by rapaul-nv · Pull Request #162 · NVIDIA-NeMo/Nemotron

rapaul-nv · 2026-04-23T09:00:07Z

Summary

Brings nano3 and super3 recipes onto NVIDIA DGX Cloud Lepton and DGX Cloud (run:ai) alongside the existing Slurm path — including data prep, SFT/pretrain (torchrun), and multi-node Ray GRPO RL. Replaces pip install … @ git+… on cloud pods with a unified local-source transport, refactors the Slurm executor into a shared factory (fixing a silent auto-mount bug), wires super3 data-prep through the cloud path, and bumps nemo-run to upstream main.

What this PR delivers

1. Lepton + DGX Cloud executor support

A new cloud execution path through nemo_runspec.execution:

execute_cloud — inline run.Script submission for non-Ray cloud jobs (data prep, torchrun SFT/pretrain).
execute_cloud_ray — RayCluster + RayJob for launch="ray" recipes; used by nano3/super3 RL on Lepton/DGX Cloud at nodes > 1.
Lepton + DGX Cloud factory functions aligned with nemo-run 0.10 (client_id/client_secret, kube_apiserver_url, ray_version).
Shared get_executor_type helper for CLI dispatchers; pipeline configs accept "lepton" / "dgxcloud" literals.
nano3 and super3 CLI commands (sft, pretrain, rl, data/prep/*) all dispatch through this.

2. Local-source transport for cloud pods

A new nemo_runspec.data_mover module replaces pip install nemotron @ git+… with a single tarball-shipping flow. One function, plan_for(executor_type, ...), returns a Plan describing how to deliver src/ to the pod:

Executor	Transport	Why
Slurm (and other native paths)	nemo-run's native packager extracts the tarball into `/nemo_run/code/src`	already works; no chunking needed
Lepton	base64 the tarball, chunk into env vars (96 KiB each), pod reassembles with a one-liner `python3 -c '…' \| tar -xz`	argv has 128 KiB `MAX_ARG_STRLEN`; env vars bypass it
DGX Cloud (run:AI)	same env-var chunking, 9 KiB per chunk	run:AI hard-caps each env-var value at 10 000 chars

Both cloud paths share the same code path; only the chunk size differs. The pod-side reassembly script is identical.

Multi-pod NFS race fix (Lepton/DGX Cloud): when N pods share the same destination on NFS, each would otherwise rm -rf && tar -xz concurrently and clobber each other. NODE_RANK=0 extracts and drops a marker; other ranks wait on it. The marker name is suffixed with the chunk count so stale markers from prior runs can't mislead waiters.

Auto-discovery of what to ship (_auto_includes): walks <repo>/src/*, ships every top-level package, and for packages with a recipes/ subdir ships only the active recipe family inferred from the script path (e.g. nano3) — keeps the tarball small.

Three nemo-run patches keep this working:

patch_cloud_data_mover_skip_configs() — exclude configs/ from nemo-run's inline-base64 tarball (otherwise the data-mover command exceeds kernel argv limits).
patch_dgxcloud_strip_source_chunks_from_exports() — keep chunks in run:AI's structured environmentVariables field instead of letting them get re-baked into torchrun_job.sh (otherwise move_data spends ~12 min chunking the script).
patch_dgxcloud_accept_legacy_kwargs() — silence the app_id keyword warning fiddle prints on every status poll.

Net code impact: execute_cloud / execute_cloud_ray dispatch through a single plan_for() call, removing ~250 lines of branching packager / env-var code that previously lived inline in execution.py.

New env.toml knobs: repo_root (override repo location), dgxcloud_max_args_chars (run:AI per-workload Args budget, default 9500).

3. Slurm executor refactor + auto-mount fix

Slurm, Lepton, and DGX Cloud previously had three different code paths for executor construction; they now share one factory pattern in execution.py:

New create_slurm_executor (with a launcher kwarg so Ray flows can pass None), plus _create_local_executor / _create_docker_executor.
_resolve_container_image and _resolve_nodes_gpus helpers replace three duplicated resolution blocks. create_executor becomes a small dispatcher.
Five CLI commands (nano3/data/prep/{sft,rl,pretrain}.py, nano3/rl.py, super3/rl/_base.py) drop their ~80-line inline Slurm setup and delegate.

Auto-mount bug fix (Slurm): ${auto_mount:git+…} entries register themselves into get_git_mounts() only at OmegaConf resolution time. The previous code cloned repos before the mounts list was accessed, so the registry was empty when clone_git_repos_via_tunnel ran — the resulting sbatch script was missing its bind-mounts and the container silently fell back to pre-baked Megatron-LM / Megatron-Bridge, training against the wrong code. create_slurm_executor now reads env.mounts before the tunnel's clone step, forcing OmegaConf to walk the list and populate the registry.

New env.toml field: identity — explicit SSH private-key path for the tunnel when the SSH agent isn't available.

4. super3 data-prep on cloud executors

super3 data prep {pretrain, rl, sft} now route through execute_cloud, matching the nano3 path; single-node CPU work is supported on cloud executors.
super3 stage2_rl data prep emits train/val splits and updates the manifest with per-split paths, so downstream RL stages pick up the correct artifacts without manual wiring.

5. Dependency + polish

nemo-run pinned to NVIDIA-NeMo main.
SourcePackager streamlined; plan_for unifies env-var chunking across Lepton and DGX Cloud.
New manifest resolver for data-prep output config (nemo_runspec/config/resolvers.py).
.gitignore whitelist fix: the CLI data dir moved from src/nemotron/cli/nano3/data/ to src/nemotron/cli/commands/nano3/data/; override entry updated (files were silently ignored before).

Breaking / behavioral changes

Source transport on Lepton / DGX Cloud no longer pip installs from git. Consumers depending on the old git-install path (custom dockerfiles, CI cache) need to adjust.
Auto-mount fix on Slurm changes what actually ships to training jobs — runs that accidentally succeeded against pre-baked Megatron-LM / Megatron-Bridge will now use the configured ${auto_mount:git+…} versions. Expect real behavior changes; re-baseline numbers for affected jobs.
cluster.start errors no longer swallowed — auth, quota, and image-pull failures propagate. CI or scripts that relied on silent retries may need updating.
gpus_per_node=0 is now honored when picking nproc (CPU-only cloud workloads no longer silently get a GPU nproc value).
nemo-run pin updated — verify downstream consumers pin compatibly.
New env.toml fields (identity, repo_root, dgxcloud_max_args_chars) are additive and documented in docs/nemo_runspec/nemo-run.md.

Test plan

Extends Nemotron executor layer to run on NVIDIA DGX Cloud Lepton and DGX Cloud (run:ai) alongside the existing Slurm path, including multi-node Ray-based RL (GRPO) via nemo-run's unified RayCluster + RayJob classes. Changes ------- * nemo-run pinned to the PR #480 build of rapaul-nv/Run (adds DGXCloudRayJob / LeptonRayCluster backends + client_credentials auth). * New nemo_runspec.execution: - execute_cloud : inline Script submission for non-Ray cloud jobs (data prep, torchrun-based SFT/pretrain). - execute_cloud_ray: RayCluster + RayJob path for launch="ray" recipes; used by nano3/super3 RL commands when on Lepton or DGX Cloud with nodes > 1. - _create_lepton_executor / _create_dgxcloud_executor factories, aligned with nemo-run 0.10 (client_id/client_secret, kube_apiserver_url, ray_version). - get_executor_type helper shared across all CLI command dispatchers. * CLI commands (nano3 + super3 sft/pretrain/rl + data/prep) route cloud jobs through the new execute_cloud / execute_cloud_ray paths. * env.toml profiles: [lepton], [lepton_gcp], [lepton_sft*], [lepton_rl], [dgxcloud] plus a shared [wandb] section. * Pipeline config accepts "dgxcloud"/"lepton" literals. * Tests: tests/nemo_runspec/test_execution.py covers executor creation (local/docker/slurm/lepton/dgxcloud), OmegaConf <-> plain-dict conversion, cloud workspace derivation, and git-mount command generation. Signed-off-by: Rakesh Paul <rapaul@nvidia.com>

Airgap-friendly source transport for Lepton and DGX Cloud executors: local src/ is tarballed and delivered over the same Job API the launcher already uses, replacing the old `pip install nemotron @ git+…` step. - Lepton: chunk tarball across env vars (envp bypasses the 128 KiB argv MAX_ARG_STRLEN). NODE_RANK=0 extracts on shared NFS; other pods wait on a chunk-count-suffixed marker file to avoid the multi-pod rm-rf-then-tar race. - DGX Cloud (run:AI): drop tarball into job_dir as one `.tgz` file so move_data chunks only that file — torchrun_job.sh stays small. Env-var chunks would otherwise blow up the per-file deploy count. - Fallback (Slurm / others): nemo-run native packager extraction. Infrastructure: - New nemo_runspec.data_mover module with SourcePackager, Plan, plan_for. - run.py: patch_cloud_data_mover_skip_configs() excludes configs/ from nemo-run's inline-base64 tarball on both Lepton and DGXCloud, keeping the data-mover command under the kernel's MAX_ARG_STRLEN. - execute_cloud / execute_cloud_ray dispatch through a single plan_for() call; removed ~250 lines of branching packager/env-var/flag code. - env.toml knobs: `repo_root` (repo override), `dgxcloud_max_args_chars` (tune run:AI per-workload Args budget, default 9500). Tests: tests/nemo_runspec/test_data_mover.py covers include scoping, pycache filtering, all three transport branches, and the NODE_RANK gate. Signed-off-by: Rakesh Paul <rapaul@nvidia.com>

Slurm, Lepton, and DGX Cloud now share the same factory pattern. The previous ``create_executor`` had ~80 lines of inline Slurm setup while Lepton and DGX Cloud already used dedicated helpers. Five CLI commands also maintained near-identical copies of ``_build_slurm_executor``. Changes ------- * Extract ``create_slurm_executor`` as a public helper (with a ``launcher`` kwarg so Ray-based flows can pass ``None``). Also extract ``_create_local_executor`` and ``_create_docker_executor``. * Add ``_resolve_container_image`` and ``_resolve_nodes_gpus`` shared helpers that replace the three duplicated resolution blocks. * ``create_executor`` is now a small dispatch function. * Accept an ``identity`` field in env.toml so the SSH tunnel can be given an explicit private-key path when the agent isn't available. Bug fix for Slurm auto-mounts ----------------------------- On Slurm, ``${auto_mount:git+...}`` entries in a recipe config only register the requested repo into the ``get_git_mounts()`` registry at the moment OmegaConf resolves them. The previous code cloned the repos *before* the mounts list was accessed, so the registry was empty and the resulting sbatch script was missing the bind mounts. The container then silently fell back to the pre-baked Megatron-LM / Megatron-Bridge versions, causing training to run against the wrong code. ``_create_slurm_executor`` now reads ``env.mounts`` before cloning, which triggers resolution and populates the registry. CLI commands ------------ The following commands no longer carry their own Slurm executor setup and delegate to ``create_slurm_executor`` (with ``launcher=None`` for Ray flows): * ``nano3/data/prep/sft.py`` * ``nano3/data/prep/rl.py`` * ``nano3/data/prep/pretrain.py`` * ``nano3/rl.py`` * ``super3/rl/_base.py`` (persistent_cache and sif_dir extras applied on top of the returned executor) Misc ---- * Fix stale ``.gitignore`` whitelist: the CLI data directory moved from ``src/nemotron/cli/nano3/data/`` to ``src/nemotron/cli/commands/nano3/data/`` but the override entry was never updated, so the files were silently ignored. Verification ------------ * 131/131 unit tests pass. * Lint clean on all touched files. * End-to-end Slurm SFT data prep succeeds on CW DFW CS-001 (2 nodes, 10k samples, 47M tokens). Signed-off-by: Rakesh Paul <rapaul@nvidia.com>

- Updated the `nemo-run` dependency to point to the main branch of the NVIDIA-NeMo repository. - Enhanced the `SourcePackager` class to streamline tarball creation for job directories. - Refactored the `plan_for` function to unify environment variable chunking for Lepton and DGX Cloud executors, improving data transport efficiency. - Introduced new patches to handle legacy keyword arguments and optimize environment variable handling in DGX Cloud. - Added a manifest resolver to facilitate configuration management for data preparation outputs. This commit ensures better compatibility with upstream changes and enhances the overall performance of the data mover functionality across different execution environments.

- Added support for executing data preparation commands (pretrain, rl, sft) on DGX Cloud and Lepton using the new `execute_cloud` function. - Implemented logic to handle single-node CPU work for cloud environments, ensuring compatibility with the existing data preparation workflow. - Introduced functions to create train/val splits and update the manifest with per-split paths in the RL data preparation process. This update improves the flexibility and efficiency of data preparation across different execution environments.

- docs: document required kube_apiserver_url; promote client_id/client_secret as primary auth fields with app_id/app_secret as legacy aliases - run.py: use argv list form for tar to avoid shell-metachar issues in nemo_run_dir paths - execution.py: honor explicit gpus_per_node=0 when picking nproc - execution.py: rename shadowed local launch -> launch_cmd - execution.py: narrow cluster.start except to "already exists" idempotency, re-raise auth/quota/image-pull failures Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

marcromeyn · 2026-04-28T08:53:12Z

env.toml gains [lepton], [lepton_gcp], [lepton_sft*], [lepton_rl], [dgxcloud] profiles plus a shared [wandb] section.

Why is the lepton_sft* needed?

rapaul-nv · 2026-04-28T09:29:35Z

@marcromeyn Updated the PR description as it was old and not well summarized!

rapaul-nv added 6 commits April 19, 2026 01:33

Merge branch 'main' into rapaul/customization-recipe-main

68c74b5

marcromeyn requested changes Apr 27, 2026

View reviewed changes

Comment thread docs/nemo_runspec/nemo-run.md

Comment thread src/nemo_runspec/run.py

Comment thread src/nemo_runspec/execution.py Outdated

Comment thread src/nemo_runspec/execution.py Outdated

Comment thread src/nemo_runspec/execution.py Outdated

rapaul-nv and others added 2 commits April 28, 2026 00:32

Merge branch 'main' into rapaul/customization-recipe-main

a183ba4

rapaul-nv requested a review from marcromeyn April 28, 2026 08:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Lepton and DGX Cloud executors#162

Add support for Lepton and DGX Cloud executors#162
rapaul-nv wants to merge 8 commits intoNVIDIA-NeMo:mainfrom
rkalaniNV:rapaul/customization-recipe-main

rapaul-nv commented Apr 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marcromeyn commented Apr 28, 2026

Uh oh!

rapaul-nv commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rapaul-nv commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What this PR delivers

1. Lepton + DGX Cloud executor support

2. Local-source transport for cloud pods

3. Slurm executor refactor + auto-mount fix

4. super3 data-prep on cloud executors

5. Dependency + polish

Breaking / behavioral changes

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marcromeyn commented Apr 28, 2026

Uh oh!

rapaul-nv commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rapaul-nv commented Apr 23, 2026 •

edited

Loading