feat(embed): multi-GPU finetune via shared nemo_runspec.execute_uv_local#165
Draft
oliverholworthy wants to merge 7 commits intomainfrom
Draft
feat(embed): multi-GPU finetune via shared nemo_runspec.execute_uv_local#165oliverholworthy wants to merge 7 commits intomainfrom
oliverholworthy wants to merge 7 commits intomainfrom
Conversation
…ecution Add execute_uv_local alongside execute_local. When torch is already importable (e.g., inside an NVIDIA container), creates a venv with --system-site-packages and excludes torch from UV resolution. This avoids the CUDA version mismatch where UV's torch-backend=auto detects the kernel driver's CUDA version (via nvidia-smi) but the container's libcuda.so is a different version. When torch is NOT importable (bare machine), falls back to uv run --with torch with UV_TORCH_BACKEND=auto. Move _write_temp_pyproject into nemo_runspec._pyproject so both the new helper and nemotron.kit.run_uv (the remote/Slurm wrapper) share one implementation. Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Replace the duplicated inline _execute_uv_local bodies in embed
{eval,export,finetune,prep,sdg}.py with calls to the shared
nemo_runspec.execution.execute_uv_local helper. No behavior change
on the bare-machine path; gains container-torch / CUDA-mismatch
handling automatically when torch is pre-installed.
export.py passes the tensorrt extra via the new extras= kwarg instead
of assembling --extra on the command line directly.
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Use torch.distributed.run with --nproc_per_node=gpu so training automatically uses all available GPUs (works correctly with 1 GPU too). The local path goes through execute_uv_local's pre_script_args hook; the remote path is selected by the PEP 723 launch = "torchrun" header. Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
…ecution Add execute_uv_local alongside execute_local. When torch is already importable (e.g., inside an NVIDIA container), creates a venv with --system-site-packages and excludes torch from UV resolution. This avoids the CUDA version mismatch where UV's torch-backend detects the kernel driver's CUDA version but the container's libcuda.so is a different version. On bare machines, runs `uv run --project <stage>` against the stage's pyproject + uv.lock with no extra `--with torch`. If the stage declares mutually-exclusive cuXXX optional-dependencies (the standard UV multi-CUDA pattern from https://docs.astral.sh/uv/guides/integration/pytorch/), auto-detects the host's NVIDIA driver and passes `--extra cuXXX` so UV picks a torch wheel matching the driver. The driver detection logic and the driver→cuXXX table are ported from astral-sh/uv (MIT/Apache-2.0): - crates/uv-torch/src/accelerator.rs (detection order: env override, /sys/module/nvidia/version, /proc/driver/nvidia/version, nvidia-smi) - crates/uv-torch/src/backend.rs (LINUX_CUDA_DRIVERS table) UV_TORCH_BACKEND=auto is intentionally NOT set — it is honored by `uv pip`/`uv add`/`uv sync`, not by `uv run --with`/`uv run --project`, so it would be a no-op here (per Steve Han's investigation). Move _write_temp_pyproject into nemo_runspec._pyproject so both the new helper and nemotron.kit.run_uv (the remote/Slurm wrapper) share one implementation. Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Add a [tool.uv.sources] override in each embed stage's pyproject.toml
mapping torch to the pytorch-cu129 index on Linux. cu129 wheels are
forward-compatible with cu13.x drivers (newer drivers run older toolkit
binaries) and match the existing torch source declared by nemo-automodel
(commit ecd7cb4), so the resolver doesn't fall through to the latest
PyPI default (currently cu130) which fails on hosts with a CUDA 12.9
driver.
Multi-extra (cu129/cu130) was attempted first but UV reported
"conflicting indexes for package torch in all marker environments"
because nemo-automodel pins Linux to pytorch-cu129 unconditionally and
that mapping is honored even when nemo-automodel is a transitive
git-installed dependency. Aligning with nemo-automodel's pattern is the
simplest path that resolves cleanly. When nemo-automodel adopts
multi-extra, this can be revisited.
The CLI's CUDA-extra auto-detection (in
nemo_runspec.execution._pick_cuda_extra) remains in place but is inert
for these stages because no cu* extras are declared.
NOTE: each stage's uv.lock needs to be regenerated:
for s in stage0_sdg stage1_data_prep stage2_finetune stage3_eval stage4_export; do
uv lock --project src/nemotron/recipes/embed/$s
done
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
c9f1fc8 to
45639dc
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
nemo_runspec.execution.execute_uv_localas a sibling toexecute_local, with container-torch / CUDA-mismatch handling andpre_script_args/extras/extra_withhooks.eval,export,finetune,prep,sdg) off their duplicated inline UV-subprocess bodies onto the shared helper (−75 net LoC).torch.distributed.run --nproc_per_node=gpu, and sets PEP 723launch = "torchrun"so the remote/Slurmpath also runs distributed.
Rationale for the helper landing in
nemo_runspecLocal-execution helpers should live together (
execute_localalready does). The rerank preview branch keeps its copy innemotron.kit.uv_local; once thislands, that preview can migrate with an import-path change.
Test plan
nemotron embed finetune -c defaulton a single-GPU host (torchrun degrades to 1 worker).nemotron embed finetune -c defaulton a multi-GPU host, confirm all GPUs utilized.nemotron embed {eval,export,prep,sdg}still run locally (regression check for the migration).nemotron embed export ... export_to_trt=truestill picks up the TensorRT extra.