feat(embed): multi-GPU finetune via shared nemo_runspec.execute_uv_local by oliverholworthy · Pull Request #165 · NVIDIA-NeMo/Nemotron

oliverholworthy · 2026-04-24T16:58:45Z

Summary

Adds nemo_runspec.execution.execute_uv_local as a sibling to execute_local, with container-torch / CUDA-mismatch handling and pre_script_args /
extras / extra_with hooks.
Migrates the 5 embed CLI commands (eval, export, finetune, prep, sdg) off their duplicated inline UV-subprocess bodies onto the shared helper (−75 net LoC).
Wires embed finetune's local launch path through torch.distributed.run --nproc_per_node=gpu, and sets PEP 723 launch = "torchrun" so the remote/Slurm
path also runs distributed.

Rationale for the helper landing in `nemo_runspec`

Local-execution helpers should live together (execute_local already does). The rerank preview branch keeps its copy in nemotron.kit.uv_local; once this
lands, that preview can migrate with an import-path change.

Test plan

nemotron embed finetune -c default on a single-GPU host (torchrun degrades to 1 worker).
nemotron embed finetune -c default on a multi-GPU host, confirm all GPUs utilized.
nemotron embed {eval,export,prep,sdg} still run locally (regression check for the migration).
nemotron embed export ... export_to_trt=true still picks up the TensorRT extra.
Remote/Slurm submission for embed finetune uses the torchrun launcher.

…ecution Add execute_uv_local alongside execute_local. When torch is already importable (e.g., inside an NVIDIA container), creates a venv with --system-site-packages and excludes torch from UV resolution. This avoids the CUDA version mismatch where UV's torch-backend=auto detects the kernel driver's CUDA version (via nvidia-smi) but the container's libcuda.so is a different version. When torch is NOT importable (bare machine), falls back to uv run --with torch with UV_TORCH_BACKEND=auto. Move _write_temp_pyproject into nemo_runspec._pyproject so both the new helper and nemotron.kit.run_uv (the remote/Slurm wrapper) share one implementation. Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>

Replace the duplicated inline _execute_uv_local bodies in embed {eval,export,finetune,prep,sdg}.py with calls to the shared nemo_runspec.execution.execute_uv_local helper. No behavior change on the bare-machine path; gains container-torch / CUDA-mismatch handling automatically when torch is pre-installed. export.py passes the tensorrt extra via the new extras= kwarg instead of assembling --extra on the command line directly. Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>

Use torch.distributed.run with --nproc_per_node=gpu so training automatically uses all available GPUs (works correctly with 1 GPU too). The local path goes through execute_uv_local's pre_script_args hook; the remote path is selected by the PEP 723 launch = "torchrun" header. Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>

…ecution Add execute_uv_local alongside execute_local. When torch is already importable (e.g., inside an NVIDIA container), creates a venv with --system-site-packages and excludes torch from UV resolution. This avoids the CUDA version mismatch where UV's torch-backend detects the kernel driver's CUDA version but the container's libcuda.so is a different version. On bare machines, runs `uv run --project <stage>` against the stage's pyproject + uv.lock with no extra `--with torch`. If the stage declares mutually-exclusive cuXXX optional-dependencies (the standard UV multi-CUDA pattern from https://docs.astral.sh/uv/guides/integration/pytorch/), auto-detects the host's NVIDIA driver and passes `--extra cuXXX` so UV picks a torch wheel matching the driver. The driver detection logic and the driver→cuXXX table are ported from astral-sh/uv (MIT/Apache-2.0): - crates/uv-torch/src/accelerator.rs (detection order: env override, /sys/module/nvidia/version, /proc/driver/nvidia/version, nvidia-smi) - crates/uv-torch/src/backend.rs (LINUX_CUDA_DRIVERS table) UV_TORCH_BACKEND=auto is intentionally NOT set — it is honored by `uv pip`/`uv add`/`uv sync`, not by `uv run --with`/`uv run --project`, so it would be a no-op here (per Steve Han's investigation). Move _write_temp_pyproject into nemo_runspec._pyproject so both the new helper and nemotron.kit.run_uv (the remote/Slurm wrapper) share one implementation. Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>

Add a [tool.uv.sources] override in each embed stage's pyproject.toml mapping torch to the pytorch-cu129 index on Linux. cu129 wheels are forward-compatible with cu13.x drivers (newer drivers run older toolkit binaries) and match the existing torch source declared by nemo-automodel (commit ecd7cb4), so the resolver doesn't fall through to the latest PyPI default (currently cu130) which fails on hosts with a CUDA 12.9 driver. Multi-extra (cu129/cu130) was attempted first but UV reported "conflicting indexes for package torch in all marker environments" because nemo-automodel pins Linux to pytorch-cu129 unconditionally and that mapping is honored even when nemo-automodel is a transitive git-installed dependency. Aligning with nemo-automodel's pattern is the simplest path that resolves cleanly. When nemo-automodel adopts multi-extra, this can be revisited. The CLI's CUDA-extra auto-detection (in nemo_runspec.execution._pick_cuda_extra) remains in place but is inert for these stages because no cu* extras are declared. NOTE: each stage's uv.lock needs to be regenerated: for s in stage0_sdg stage1_data_prep stage2_finetune stage3_eval stage4_export; do uv lock --project src/nemotron/recipes/embed/$s done Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>

oliverholworthy added 7 commits April 28, 2026 15:26

Pin embed torch wheels to cu129

3e3c327

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>

Remove CUDA extra auto-detection

45639dc

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>

oliverholworthy force-pushed the oholworthy/embed-finetune-multi-gpu branch from c9f1fc8 to 45639dc Compare April 28, 2026 14:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(embed): multi-GPU finetune via shared nemo_runspec.execute_uv_local#165

feat(embed): multi-GPU finetune via shared nemo_runspec.execute_uv_local#165
oliverholworthy wants to merge 7 commits intomainfrom
oholworthy/embed-finetune-multi-gpu

oliverholworthy commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

oliverholworthy commented Apr 24, 2026

Summary

Rationale for the helper landing in nemo_runspec

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Rationale for the helper landing in `nemo_runspec`