Skip to content

feat(embed): multi-GPU finetune via shared nemo_runspec.execute_uv_local#165

Draft
oliverholworthy wants to merge 7 commits intomainfrom
oholworthy/embed-finetune-multi-gpu
Draft

feat(embed): multi-GPU finetune via shared nemo_runspec.execute_uv_local#165
oliverholworthy wants to merge 7 commits intomainfrom
oholworthy/embed-finetune-multi-gpu

Conversation

@oliverholworthy
Copy link
Copy Markdown
Contributor

Summary

  • Adds nemo_runspec.execution.execute_uv_local as a sibling to execute_local, with container-torch / CUDA-mismatch handling and pre_script_args /
    extras / extra_with hooks.
  • Migrates the 5 embed CLI commands (eval, export, finetune, prep, sdg) off their duplicated inline UV-subprocess bodies onto the shared helper (−75 net LoC).
  • Wires embed finetune's local launch path through torch.distributed.run --nproc_per_node=gpu, and sets PEP 723 launch = "torchrun" so the remote/Slurm
    path also runs distributed.

Rationale for the helper landing in nemo_runspec

Local-execution helpers should live together (execute_local already does). The rerank preview branch keeps its copy in nemotron.kit.uv_local; once this
lands, that preview can migrate with an import-path change.

Test plan

  • nemotron embed finetune -c default on a single-GPU host (torchrun degrades to 1 worker).
  • nemotron embed finetune -c default on a multi-GPU host, confirm all GPUs utilized.
  • nemotron embed {eval,export,prep,sdg} still run locally (regression check for the migration).
  • nemotron embed export ... export_to_trt=true still picks up the TensorRT extra.
  • Remote/Slurm submission for embed finetune uses the torchrun launcher.

…ecution

Add execute_uv_local alongside execute_local. When torch is already
importable (e.g., inside an NVIDIA container), creates a venv with
--system-site-packages and excludes torch from UV resolution. This
avoids the CUDA version mismatch where UV's torch-backend=auto detects
the kernel driver's CUDA version (via nvidia-smi) but the container's
libcuda.so is a different version.

When torch is NOT importable (bare machine), falls back to
uv run --with torch with UV_TORCH_BACKEND=auto.

Move _write_temp_pyproject into nemo_runspec._pyproject so both the
new helper and nemotron.kit.run_uv (the remote/Slurm wrapper) share
one implementation.

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Replace the duplicated inline _execute_uv_local bodies in embed
{eval,export,finetune,prep,sdg}.py with calls to the shared
nemo_runspec.execution.execute_uv_local helper. No behavior change
on the bare-machine path; gains container-torch / CUDA-mismatch
handling automatically when torch is pre-installed.

export.py passes the tensorrt extra via the new extras= kwarg instead
of assembling --extra on the command line directly.

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Use torch.distributed.run with --nproc_per_node=gpu so training
automatically uses all available GPUs (works correctly with 1 GPU too).

The local path goes through execute_uv_local's pre_script_args hook;
the remote path is selected by the PEP 723 launch = "torchrun" header.

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
…ecution

Add execute_uv_local alongside execute_local. When torch is already
importable (e.g., inside an NVIDIA container), creates a venv with
--system-site-packages and excludes torch from UV resolution. This
avoids the CUDA version mismatch where UV's torch-backend detects the
kernel driver's CUDA version but the container's libcuda.so is a
different version.

On bare machines, runs `uv run --project <stage>` against the stage's
pyproject + uv.lock with no extra `--with torch`. If the stage declares
mutually-exclusive cuXXX optional-dependencies (the standard UV
multi-CUDA pattern from
https://docs.astral.sh/uv/guides/integration/pytorch/), auto-detects
the host's NVIDIA driver and passes `--extra cuXXX` so UV picks a
torch wheel matching the driver.

The driver detection logic and the driver→cuXXX table are ported from
astral-sh/uv (MIT/Apache-2.0):
  - crates/uv-torch/src/accelerator.rs (detection order: env override,
    /sys/module/nvidia/version, /proc/driver/nvidia/version, nvidia-smi)
  - crates/uv-torch/src/backend.rs (LINUX_CUDA_DRIVERS table)

UV_TORCH_BACKEND=auto is intentionally NOT set — it is honored by
`uv pip`/`uv add`/`uv sync`, not by `uv run --with`/`uv run --project`,
so it would be a no-op here (per Steve Han's investigation).

Move _write_temp_pyproject into nemo_runspec._pyproject so both the
new helper and nemotron.kit.run_uv (the remote/Slurm wrapper) share
one implementation.

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Add a [tool.uv.sources] override in each embed stage's pyproject.toml
mapping torch to the pytorch-cu129 index on Linux. cu129 wheels are
forward-compatible with cu13.x drivers (newer drivers run older toolkit
binaries) and match the existing torch source declared by nemo-automodel
(commit ecd7cb4), so the resolver doesn't fall through to the latest
PyPI default (currently cu130) which fails on hosts with a CUDA 12.9
driver.

Multi-extra (cu129/cu130) was attempted first but UV reported
"conflicting indexes for package torch in all marker environments"
because nemo-automodel pins Linux to pytorch-cu129 unconditionally and
that mapping is honored even when nemo-automodel is a transitive
git-installed dependency. Aligning with nemo-automodel's pattern is the
simplest path that resolves cleanly. When nemo-automodel adopts
multi-extra, this can be revisited.

The CLI's CUDA-extra auto-detection (in
nemo_runspec.execution._pick_cuda_extra) remains in place but is inert
for these stages because no cu* extras are declared.

NOTE: each stage's uv.lock needs to be regenerated:

  for s in stage0_sdg stage1_data_prep stage2_finetune stage3_eval stage4_export; do
    uv lock --project src/nemotron/recipes/embed/$s
  done

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
@oliverholworthy oliverholworthy force-pushed the oholworthy/embed-finetune-multi-gpu branch from c9f1fc8 to 45639dc Compare April 28, 2026 14:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant