Skip to content

fix(embed): pin stage3 eval torch to CUDA 12.9 build#164

Open
shan-nvidia wants to merge 1 commit intomainfrom
fix/embed-eval-torch-cu129
Open

fix(embed): pin stage3 eval torch to CUDA 12.9 build#164
shan-nvidia wants to merge 1 commit intomainfrom
fix/embed-eval-torch-cu129

Conversation

@shan-nvidia
Copy link
Copy Markdown
Contributor

@shan-nvidia shan-nvidia commented Apr 24, 2026

Summary

Without an explicit pin, uv resolves torch==2.11.0 from PyPI for the stage3_eval recipe, which requires CUDA 13 (cuda-toolkit==13.0.2). On systems whose NVIDIA driver only supports CUDA 12.x, torch.cuda.is_available() returns False and BEIR silently falls back to CPU, making eval effectively unusable on multi-GPU hosts — you just see the NVIDIA driver ... is too old (found version 12090) warning and the 0/4 progress bar crawls.

Stages 1 and 2 avoid this because their nemo-automodel dependency transitively pins torch==2.10.0+cu129 via [tool.uv.sources] + the pytorch-cu129 index. Stage 3 doesn't depend on nemo-automodel, so this PR adds the same configuration directly:

  • Add torch<=2.10.0 to dependencies
  • Mirror nemo-automodel's [tool.uv.sources] torch selector and the pytorch-cpu / pytorch-cu129 / pypi index blocks

Also includes the regenerated uv.lock so Linux resolves to torch 2.10.0+cu129.

Test plan

  • rm -rf src/nemotron/recipes/embed/stage3_eval/.venv
  • nemotron embed eval -c default <finetuned/eval/output paths> runs without the NVIDIA driver ... is too old warning
  • nvidia-smi during eval shows GPU memory and utilization across visible devices
  • Eval completes in minutes rather than hours on 8× A100

Without an explicit pin, `uv` resolves `torch==2.11.0` from PyPI for the
stage3 eval recipe, which requires CUDA 13 (`cuda-toolkit==13.0.2`). On
systems with NVIDIA drivers that only support CUDA 12.x, `torch.cuda.is_available()`
returns False and BEIR silently falls back to CPU, making eval effectively
unusable on multi-GPU hosts.

This mirrors the pattern nemo-automodel uses for other stages: pin
`torch<=2.10.0` and point `[tool.uv.sources]` at the `pytorch-cu129` index
so Linux gets `torch==2.10.0+cu129`. Stages 1 and 2 already get this via
their nemo-automodel dependency; stage3 doesn't use nemo-automodel, so the
configuration is added explicitly here.

Signed-off-by: Steve Han <sthan@nvidia.com>
Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant