fix(embed): pin stage3 eval torch to CUDA 12.9 build by shan-nvidia · Pull Request #164 · NVIDIA-NeMo/Nemotron

shan-nvidia · 2026-04-24T16:01:33Z

Summary

Without an explicit pin, uv resolves torch==2.11.0 from PyPI for the stage3_eval recipe, which requires CUDA 13 (cuda-toolkit==13.0.2). On systems whose NVIDIA driver only supports CUDA 12.x, torch.cuda.is_available() returns False and BEIR silently falls back to CPU, making eval effectively unusable on multi-GPU hosts — you just see the NVIDIA driver ... is too old (found version 12090) warning and the 0/4 progress bar crawls.

Stages 1 and 2 avoid this because their nemo-automodel dependency transitively pins torch==2.10.0+cu129 via [tool.uv.sources] + the pytorch-cu129 index. Stage 3 doesn't depend on nemo-automodel, so this PR adds the same configuration directly:

Add torch<=2.10.0 to dependencies
Mirror nemo-automodel's [tool.uv.sources] torch selector and the pytorch-cpu / pytorch-cu129 / pypi index blocks

Also includes the regenerated uv.lock so Linux resolves to torch 2.10.0+cu129.

Test plan

rm -rf src/nemotron/recipes/embed/stage3_eval/.venv
nemotron embed eval -c default <finetuned/eval/output paths> runs without the NVIDIA driver ... is too old warning
nvidia-smi during eval shows GPU memory and utilization across visible devices
Eval completes in minutes rather than hours on 8× A100

Without an explicit pin, `uv` resolves `torch==2.11.0` from PyPI for the stage3 eval recipe, which requires CUDA 13 (`cuda-toolkit==13.0.2`). On systems with NVIDIA drivers that only support CUDA 12.x, `torch.cuda.is_available()` returns False and BEIR silently falls back to CPU, making eval effectively unusable on multi-GPU hosts. This mirrors the pattern nemo-automodel uses for other stages: pin `torch<=2.10.0` and point `[tool.uv.sources]` at the `pytorch-cu129` index so Linux gets `torch==2.10.0+cu129`. Stages 1 and 2 already get this via their nemo-automodel dependency; stage3 doesn't use nemo-automodel, so the configuration is added explicitly here. Signed-off-by: Steve Han <sthan@nvidia.com> Made-with: Cursor

shan-nvidia requested a review from oliverholworthy April 24, 2026 16:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(embed): pin stage3 eval torch to CUDA 12.9 build#164

fix(embed): pin stage3 eval torch to CUDA 12.9 build#164
shan-nvidia wants to merge 1 commit intomainfrom
fix/embed-eval-torch-cu129

shan-nvidia commented Apr 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shan-nvidia commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shan-nvidia commented Apr 24, 2026 •

edited

Loading