Skip to content

Add support for Lepton and DGX Cloud executors#162

Open
rapaul-nv wants to merge 8 commits intoNVIDIA-NeMo:mainfrom
rkalaniNV:rapaul/customization-recipe-main
Open

Add support for Lepton and DGX Cloud executors#162
rapaul-nv wants to merge 8 commits intoNVIDIA-NeMo:mainfrom
rkalaniNV:rapaul/customization-recipe-main

Conversation

@rapaul-nv
Copy link
Copy Markdown

@rapaul-nv rapaul-nv commented Apr 23, 2026

Summary

Brings nano3 and super3 recipes onto NVIDIA DGX Cloud Lepton and DGX Cloud (run:ai) alongside the existing Slurm path — including data prep, SFT/pretrain (torchrun), and multi-node Ray GRPO RL. Replaces pip install … @ git+… on cloud pods with a unified local-source transport, refactors the Slurm executor into a shared factory (fixing a silent auto-mount bug), wires super3 data-prep through the cloud path, and bumps nemo-run to upstream main.

What this PR delivers

1. Lepton + DGX Cloud executor support

A new cloud execution path through nemo_runspec.execution:

  • execute_cloud — inline run.Script submission for non-Ray cloud jobs (data prep, torchrun SFT/pretrain).
  • execute_cloud_rayRayCluster + RayJob for launch="ray" recipes; used by nano3/super3 RL on Lepton/DGX Cloud at nodes > 1.
  • Lepton + DGX Cloud factory functions aligned with nemo-run 0.10 (client_id/client_secret, kube_apiserver_url, ray_version).
  • Shared get_executor_type helper for CLI dispatchers; pipeline configs accept "lepton" / "dgxcloud" literals.
  • nano3 and super3 CLI commands (sft, pretrain, rl, data/prep/*) all dispatch through this.

2. Local-source transport for cloud pods

A new nemo_runspec.data_mover module replaces pip install nemotron @ git+… with a single tarball-shipping flow. One function, plan_for(executor_type, ...), returns a Plan describing how to deliver src/ to the pod:

Executor Transport Why
Slurm (and other native paths) nemo-run's native packager extracts the tarball into /nemo_run/code/src already works; no chunking needed
Lepton base64 the tarball, chunk into env vars (96 KiB each), pod reassembles with a one-liner python3 -c '…' | tar -xz argv has 128 KiB MAX_ARG_STRLEN; env vars bypass it
DGX Cloud (run:AI) same env-var chunking, 9 KiB per chunk run:AI hard-caps each env-var value at 10 000 chars

Both cloud paths share the same code path; only the chunk size differs. The pod-side reassembly script is identical.

Multi-pod NFS race fix (Lepton/DGX Cloud): when N pods share the same destination on NFS, each would otherwise rm -rf && tar -xz concurrently and clobber each other. NODE_RANK=0 extracts and drops a marker; other ranks wait on it. The marker name is suffixed with the chunk count so stale markers from prior runs can't mislead waiters.

Auto-discovery of what to ship (_auto_includes): walks <repo>/src/*, ships every top-level package, and for packages with a recipes/ subdir ships only the active recipe family inferred from the script path (e.g. nano3) — keeps the tarball small.

Three nemo-run patches keep this working:

  • patch_cloud_data_mover_skip_configs() — exclude configs/ from nemo-run's inline-base64 tarball (otherwise the data-mover command exceeds kernel argv limits).
  • patch_dgxcloud_strip_source_chunks_from_exports() — keep chunks in run:AI's structured environmentVariables field instead of letting them get re-baked into torchrun_job.sh (otherwise move_data spends ~12 min chunking the script).
  • patch_dgxcloud_accept_legacy_kwargs() — silence the app_id keyword warning fiddle prints on every status poll.

Net code impact: execute_cloud / execute_cloud_ray dispatch through a single plan_for() call, removing ~250 lines of branching packager / env-var code that previously lived inline in execution.py.

New env.toml knobs: repo_root (override repo location), dgxcloud_max_args_chars (run:AI per-workload Args budget, default 9500).

3. Slurm executor refactor + auto-mount fix

Slurm, Lepton, and DGX Cloud previously had three different code paths for executor construction; they now share one factory pattern in execution.py:

  • New create_slurm_executor (with a launcher kwarg so Ray flows can pass None), plus _create_local_executor / _create_docker_executor.
  • _resolve_container_image and _resolve_nodes_gpus helpers replace three duplicated resolution blocks. create_executor becomes a small dispatcher.
  • Five CLI commands (nano3/data/prep/{sft,rl,pretrain}.py, nano3/rl.py, super3/rl/_base.py) drop their ~80-line inline Slurm setup and delegate.

Auto-mount bug fix (Slurm): ${auto_mount:git+…} entries register themselves into get_git_mounts() only at OmegaConf resolution time. The previous code cloned repos before the mounts list was accessed, so the registry was empty when clone_git_repos_via_tunnel ran — the resulting sbatch script was missing its bind-mounts and the container silently fell back to pre-baked Megatron-LM / Megatron-Bridge, training against the wrong code. create_slurm_executor now reads env.mounts before the tunnel's clone step, forcing OmegaConf to walk the list and populate the registry.

New env.toml field: identity — explicit SSH private-key path for the tunnel when the SSH agent isn't available.

4. super3 data-prep on cloud executors

  • super3 data prep {pretrain, rl, sft} now route through execute_cloud, matching the nano3 path; single-node CPU work is supported on cloud executors.
  • super3 stage2_rl data prep emits train/val splits and updates the manifest with per-split paths, so downstream RL stages pick up the correct artifacts without manual wiring.

5. Dependency + polish

  • nemo-run pinned to NVIDIA-NeMo main.
  • SourcePackager streamlined; plan_for unifies env-var chunking across Lepton and DGX Cloud.
  • New manifest resolver for data-prep output config (nemo_runspec/config/resolvers.py).
  • .gitignore whitelist fix: the CLI data dir moved from src/nemotron/cli/nano3/data/ to src/nemotron/cli/commands/nano3/data/; override entry updated (files were silently ignored before).

Breaking / behavioral changes

  • Source transport on Lepton / DGX Cloud no longer pip installs from git. Consumers depending on the old git-install path (custom dockerfiles, CI cache) need to adjust.
  • Auto-mount fix on Slurm changes what actually ships to training jobs — runs that accidentally succeeded against pre-baked Megatron-LM / Megatron-Bridge will now use the configured ${auto_mount:git+…} versions. Expect real behavior changes; re-baseline numbers for affected jobs.
  • cluster.start errors no longer swallowed — auth, quota, and image-pull failures propagate. CI or scripts that relied on silent retries may need updating.
  • gpus_per_node=0 is now honored when picking nproc (CPU-only cloud workloads no longer silently get a GPU nproc value).
  • nemo-run pin updated — verify downstream consumers pin compatibly.
  • New env.toml fields (identity, repo_root, dgxcloud_max_args_chars) are additive and documented in docs/nemo_runspec/nemo-run.md.

Test plan

  • pytest tests/nemo_runspec/test_execution.py tests/nemo_runspec/test_data_mover.py passes
  • Full suite passes (131+ tests), lint clean
  • Slurm SFT data prep on CW DFW CS-001 succeeds (2 nodes, tiny)
  • Lepton data prep succeeds (1 node, A100)
  • Lepton multi-node SFT uses local-source transport (no git install); pod-side reassembly succeeds and NODE_RANK=0 marker gates the others
  • DGX Cloud (run:AI) job submission completes; chunks land in environmentVariables (not torchrun_job.sh); Args stays under budget
  • nano3/super3 RL on Lepton with nodes > 1 goes through execute_cloud_ray
  • Slurm sbatch script shows mount lines for ${auto_mount:git+…} (auto-mount fix regression guard)
  • super3 data prep (pretrain/rl/sft) on Lepton and DGX Cloud
  • super3 stage2_rl data prep emits train/ + val/ splits and updated manifest
  • Bad DGX Cloud creds surface a real auth error (narrowed except regression guard)
  • gpus_per_node=0 picks the right nproc (CPU-only regression guard)

Extends Nemotron executor layer to run on NVIDIA DGX Cloud Lepton and
DGX Cloud (run:ai) alongside the existing Slurm path, including multi-node
Ray-based RL (GRPO) via nemo-run's unified RayCluster + RayJob classes.

Changes
-------
* nemo-run pinned to the PR #480 build of rapaul-nv/Run (adds
  DGXCloudRayJob / LeptonRayCluster backends + client_credentials auth).
* New nemo_runspec.execution:
  - execute_cloud    : inline Script submission for non-Ray cloud jobs
                       (data prep, torchrun-based SFT/pretrain).
  - execute_cloud_ray: RayCluster + RayJob path for launch="ray" recipes;
                       used by nano3/super3 RL commands when on Lepton or
                       DGX Cloud with nodes > 1.
  - _create_lepton_executor / _create_dgxcloud_executor factories, aligned
    with nemo-run 0.10 (client_id/client_secret, kube_apiserver_url,
    ray_version).
  - get_executor_type helper shared across all CLI command dispatchers.
* CLI commands (nano3 + super3 sft/pretrain/rl + data/prep) route cloud
  jobs through the new execute_cloud / execute_cloud_ray paths.
* env.toml profiles: [lepton], [lepton_gcp], [lepton_sft*], [lepton_rl],
  [dgxcloud] plus a shared [wandb] section.
* Pipeline config accepts "dgxcloud"/"lepton" literals.
* Tests: tests/nemo_runspec/test_execution.py covers executor creation
  (local/docker/slurm/lepton/dgxcloud), OmegaConf <-> plain-dict conversion,
  cloud workspace derivation, and git-mount command generation.

Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
Airgap-friendly source transport for Lepton and DGX Cloud executors:
local src/ is tarballed and delivered over the same Job API the launcher
already uses, replacing the old `pip install nemotron @ git+…` step.

- Lepton: chunk tarball across env vars (envp bypasses the 128 KiB argv
  MAX_ARG_STRLEN). NODE_RANK=0 extracts on shared NFS; other pods wait
  on a chunk-count-suffixed marker file to avoid the multi-pod
  rm-rf-then-tar race.
- DGX Cloud (run:AI): drop tarball into job_dir as one `.tgz` file so
  move_data chunks only that file — torchrun_job.sh stays small. Env-var
  chunks would otherwise blow up the per-file deploy count.
- Fallback (Slurm / others): nemo-run native packager extraction.

Infrastructure:
- New nemo_runspec.data_mover module with SourcePackager, Plan, plan_for.
- run.py: patch_cloud_data_mover_skip_configs() excludes configs/ from
  nemo-run's inline-base64 tarball on both Lepton and DGXCloud, keeping
  the data-mover command under the kernel's MAX_ARG_STRLEN.
- execute_cloud / execute_cloud_ray dispatch through a single plan_for()
  call; removed ~250 lines of branching packager/env-var/flag code.
- env.toml knobs: `repo_root` (repo override), `dgxcloud_max_args_chars`
  (tune run:AI per-workload Args budget, default 9500).

Tests: tests/nemo_runspec/test_data_mover.py covers include scoping,
pycache filtering, all three transport branches, and the NODE_RANK gate.

Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
Slurm, Lepton, and DGX Cloud now share the same factory pattern. The
previous ``create_executor`` had ~80 lines of inline Slurm setup while
Lepton and DGX Cloud already used dedicated helpers. Five CLI commands
also maintained near-identical copies of ``_build_slurm_executor``.

Changes
-------

* Extract ``create_slurm_executor`` as a public helper (with a
  ``launcher`` kwarg so Ray-based flows can pass ``None``). Also extract
  ``_create_local_executor`` and ``_create_docker_executor``.
* Add ``_resolve_container_image`` and ``_resolve_nodes_gpus`` shared
  helpers that replace the three duplicated resolution blocks.
* ``create_executor`` is now a small dispatch function.
* Accept an ``identity`` field in env.toml so the SSH tunnel can be
  given an explicit private-key path when the agent isn't available.

Bug fix for Slurm auto-mounts
-----------------------------

On Slurm, ``${auto_mount:git+...}`` entries in a recipe config only
register the requested repo into the ``get_git_mounts()`` registry at
the moment OmegaConf resolves them. The previous code cloned the repos
*before* the mounts list was accessed, so the registry was empty and
the resulting sbatch script was missing the bind mounts. The container
then silently fell back to the pre-baked Megatron-LM / Megatron-Bridge
versions, causing training to run against the wrong code.

``_create_slurm_executor`` now reads ``env.mounts`` before cloning,
which triggers resolution and populates the registry.

CLI commands
------------

The following commands no longer carry their own Slurm executor setup
and delegate to ``create_slurm_executor`` (with ``launcher=None`` for
Ray flows):

* ``nano3/data/prep/sft.py``
* ``nano3/data/prep/rl.py``
* ``nano3/data/prep/pretrain.py``
* ``nano3/rl.py``
* ``super3/rl/_base.py`` (persistent_cache and sif_dir extras applied
  on top of the returned executor)

Misc
----

* Fix stale ``.gitignore`` whitelist: the CLI data directory moved from
  ``src/nemotron/cli/nano3/data/`` to
  ``src/nemotron/cli/commands/nano3/data/`` but the override entry was
  never updated, so the files were silently ignored.

Verification
------------

* 131/131 unit tests pass.
* Lint clean on all touched files.
* End-to-end Slurm SFT data prep succeeds on CW DFW CS-001 (2 nodes,
  10k samples, 47M tokens).

Signed-off-by: Rakesh Paul <rapaul@nvidia.com>
- Updated the `nemo-run` dependency to point to the main branch of the NVIDIA-NeMo repository.
- Enhanced the `SourcePackager` class to streamline tarball creation for job directories.
- Refactored the `plan_for` function to unify environment variable chunking for Lepton and DGX Cloud executors, improving data transport efficiency.
- Introduced new patches to handle legacy keyword arguments and optimize environment variable handling in DGX Cloud.
- Added a manifest resolver to facilitate configuration management for data preparation outputs.

This commit ensures better compatibility with upstream changes and enhances the overall performance of the data mover functionality across different execution environments.
- Added support for executing data preparation commands (pretrain, rl, sft) on DGX Cloud and Lepton using the new `execute_cloud` function.
- Implemented logic to handle single-node CPU work for cloud environments, ensuring compatibility with the existing data preparation workflow.
- Introduced functions to create train/val splits and update the manifest with per-split paths in the RL data preparation process.

This update improves the flexibility and efficiency of data preparation across different execution environments.
Comment thread docs/nemo_runspec/nemo-run.md
Comment thread src/nemo_runspec/run.py
Comment thread src/nemo_runspec/execution.py Outdated
Comment thread src/nemo_runspec/execution.py Outdated
Comment thread src/nemo_runspec/execution.py Outdated
rapaul-nv and others added 2 commits April 28, 2026 00:32
- docs: document required kube_apiserver_url; promote client_id/client_secret
  as primary auth fields with app_id/app_secret as legacy aliases
- run.py: use argv list form for tar to avoid shell-metachar issues in
  nemo_run_dir paths
- execution.py: honor explicit gpus_per_node=0 when picking nproc
- execution.py: rename shadowed local launch -> launch_cmd
- execution.py: narrow cluster.start except to "already exists" idempotency,
  re-raise auth/quota/image-pull failures

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rapaul-nv rapaul-nv requested a review from marcromeyn April 28, 2026 08:49
@marcromeyn
Copy link
Copy Markdown
Contributor

env.toml gains [lepton], [lepton_gcp], [lepton_sft*], [lepton_rl], [dgxcloud] profiles plus a shared [wandb] section.

Why is the lepton_sft* needed?

@rapaul-nv
Copy link
Copy Markdown
Author

@marcromeyn Updated the PR description as it was old and not well summarized!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants