Resolve a tegra platform for a single iGPU #78

elezar · 2025-11-28T15:52:02Z

Instead of requiring that all GPUs on an Tegra platform with NVML be iGPUs, we resolve it as a tegra platform if at least one device is an iGPU.

Note that this changes the semantics of platform resolution and may cause issues on Tegra-based systems with dGPUs installed IF the iGPU is also enumeratable by NVML.

elezar · 2025-11-28T15:52:31Z

This is required for the change to NVIDIA/nvidia-container-toolkit#1461

klueska · 2025-12-01T09:13:52Z

The change itself seems straightforward. There is a change in semantics though -- now it checks if any device is an iGPU rather than ensuring that all devices are iGPUs. So long as that it the intent, then LGTM.

elezar · 2025-12-01T09:48:25Z

The change itself seems straightforward. There is a change in semantics though -- now it checks if any device is an iGPU rather than ensuring that all devices are iGPUs. So long as that it the intent, then LGTM.

Yes. This is the intent. There is a risk that we may end up detecting the wrong platform on older Tegra-based systems where dGPUs are present, but I think this is relatively low.

As a follow-up thought / quesiton: What are your thoughts on having some kind of "override" for this detection logic that is implmented at this level so as to be useful to all upstream consumers? Something like an /etc/nvidia-container-toolkit/platform file that includes the platform?

elezar · 2025-12-01T09:52:14Z

Before I merge this, one thought that I had is to check whether the nvml lib that we resolve looks like a dGPU NVML lib and to use this as a signal that we're dealing with dGPU (or Tegra platform). The basic logic would be to check whether the actual libraryname has the DRIVER VERSION as a suffix, or is just .1.

What are your thoughts on this?

klueska · 2025-12-01T11:58:00Z

The change itself seems straightforward. There is a change in semantics though -- now it checks if any device is an iGPU rather than ensuring that all devices are iGPUs. So long as that it the intent, then LGTM.

Yes. This is the intent. There is a risk that we may end up detecting the wrong platform on older Tegra-based systems where dGPUs are present, but I think this is relatively low.

As a follow-up thought / quesiton: What are your thoughts on having some kind of "override" for this detection logic that is implmented at this level so as to be useful to all upstream consumers? Something like an /etc/nvidia-container-toolkit/platform file that includes the platform?

This seems reasonable to me. If someone knows what they are doing, they can make sure their platform drops a file in here to inform the toolkit which path to take. What values would you envision this honoring to start with?

klueska · 2025-12-01T11:59:13Z

Before I merge this, one thought that I had is to check whether the nvml lib that we resolve looks like a dGPU NVML lib and to use this as a signal that we're dealing with dGPU (or Tegra platform). The basic logic would be to check whether the actual libraryname has the DRIVER VERSION as a suffix, or is just .1.

What are your thoughts on this?

I don't know enough about the difference between how the driver shows up on one system vs. the other to have a strong opinion. On the surface, this "feels" very brittle though.

elezar · 2025-12-01T12:17:02Z

Before I merge this, one thought that I had is to check whether the nvml lib that we resolve looks like a dGPU NVML lib and to use this as a signal that we're dealing with dGPU (or Tegra platform). The basic logic would be to check whether the actual libraryname has the DRIVER VERSION as a suffix, or is just .1.
What are your thoughts on this?

I don't know enough about the difference between how the driver shows up on one system vs. the other to have a strong opinion. On the surface, this "feels" very brittle though.

Yes, it is brittle. It's also based on heuristics and not on a stable API. I think having an "override" functionality as discussed in the other comment would give us more flexibility without making the logic here more complex.

elezar · 2025-12-01T12:23:04Z

The change itself seems straightforward. There is a change in semantics though -- now it checks if any device is an iGPU rather than ensuring that all devices are iGPUs. So long as that it the intent, then LGTM.

Yes. This is the intent. There is a risk that we may end up detecting the wrong platform on older Tegra-based systems where dGPUs are present, but I think this is relatively low.
As a follow-up thought / quesiton: What are your thoughts on having some kind of "override" for this detection logic that is implmented at this level so as to be useful to all upstream consumers? Something like an /etc/nvidia-container-toolkit/platform file that includes the platform?

This seems reasonable to me. If someone knows what they are doing, they can make sure their platform drops a file in here to inform the toolkit which path to take. What values would you envision this honoring to start with?

My initial thought would be that we have a file:

# The following allows the platform detection used by the NVIDIA Container Toolkit and 
# other applications to be overridden. This means that in cases where the heuristics 
# employed are insufficient a platform owner / user has the option to force the detected plaform.
# Valid values include:
# * 'nvml': NVML-based systems with only discrete GPUs
# * 'tegra': A Tegra-based system with one or more integrated GPUs
# * 'wsl': A WSL2-based system
tegra

We would ignore lines starting with # or empty lines.

I can start a follow-up PR where we can discuss this in more detail.

klueska · 2025-12-01T12:36:21Z

I can start a follow-up PR where we can discuss this in more detail.

Works for me.

jgehrcke · 2025-12-01T12:45:06Z

pkg/nvlib/info/property-extractor.go

-}
-
-// HasOnlyIntegratedGPUs checks whether all GPUs are iGPUs that use NVML.
+// HasAnIntegratedGPU checks whether any GPU is an iGPUs that use NVML.


What does "use NVML" mean here?

I ask because as per the name, I'd expect "checks whether any GPU is an integrated GPU".

The comment as written is a little ambiguous. I have reworked it to (hopefully) be clearer.

The point is not that the GPUs "use NVML", but that NVML is present on the system and reports the integrated GPUs (e.g. in the nvidia-smi -L output). This wasn't always the case and changed with the release of Orin-based systems.

jgehrcke · 2025-12-01T12:48:13Z

pkg/nvlib/info/property-extractor.go

-		if !isIntegratedGPUName(name) {
-			return false, fmt.Sprintf("device %q does not use nvgpu module", name)
+		if !IsIntegratedGPUName(name) {
+			continue


If you want to 🤷 :

if IsIntegratedGPUName(name) { return ... }

You're right. I was still in the "I need to check all devices" mindset. The quick return makes the code simpler to reason about.

Instead of requiring that all GPUs on an Tegra platform with NVML be iGPUs, we resolve it as a tegra platform if at least one device is an iGPU. Signed-off-by: Evan Lezar <elezar@nvidia.com>

pkg/nvlib/info/property-extractor.go

ArangoGutierrez

LGTM

elezar · 2025-12-02T14:02:53Z

Before I merge this, one thought that I had is to check whether the nvml lib that we resolve looks like a dGPU NVML lib and to use this as a signal that we're dealing with dGPU (or Tegra platform). The basic logic would be to check whether the actual libraryname has the DRIVER VERSION as a suffix, or is just .1.
What are your thoughts on this?

I don't know enough about the difference between how the driver shows up on one system vs. the other to have a strong opinion. On the surface, this "feels" very brittle though.

Created #79 to continue the discussion on this.

Went ahead and merged this PR.

elezar self-assigned this Nov 28, 2025

elezar requested review from ArangoGutierrez, cdesiniotis, jgehrcke and klueska November 28, 2025 15:57

klueska previously approved these changes Dec 1, 2025

View reviewed changes

jgehrcke reviewed Dec 1, 2025

View reviewed changes

elezar dismissed klueska’s stale review via f2bc059 December 1, 2025 14:49

elezar force-pushed the relax-tegra-resolution branch from 02567ae to f2bc059 Compare December 1, 2025 14:49

Resolve a tegra platform for a single iGPU

eda6327

Instead of requiring that all GPUs on an Tegra platform with NVML be iGPUs, we resolve it as a tegra platform if at least one device is an iGPU. Signed-off-by: Evan Lezar <elezar@nvidia.com>

elezar force-pushed the relax-tegra-resolution branch from f2bc059 to eda6327 Compare December 1, 2025 14:51

elezar mentioned this pull request Dec 1, 2025

Add support for a platform override file #79

Open

ArangoGutierrez reviewed Dec 1, 2025

View reviewed changes

pkg/nvlib/info/property-extractor.go Show resolved Hide resolved

ArangoGutierrez approved these changes Dec 1, 2025

View reviewed changes

elezar merged commit d0f42ba into NVIDIA:main Dec 2, 2025
4 checks passed

elezar deleted the relax-tegra-resolution branch December 2, 2025 13:54

Resolve a tegra platform for a single iGPU #78

Resolve a tegra platform for a single iGPU #78

Uh oh!

Conversation

elezar commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elezar commented Nov 28, 2025

Uh oh!

klueska commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elezar commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elezar commented Dec 1, 2025

Uh oh!

klueska commented Dec 1, 2025

Uh oh!

klueska commented Dec 1, 2025

Uh oh!

elezar commented Dec 1, 2025

Uh oh!

elezar commented Dec 1, 2025

Uh oh!

klueska commented Dec 1, 2025

Uh oh!

jgehrcke Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

elezar Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

jgehrcke Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

elezar Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ArangoGutierrez left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elezar commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

elezar commented Nov 28, 2025 •

edited

Loading

klueska commented Dec 1, 2025 •

edited

Loading

elezar commented Dec 1, 2025 •

edited

Loading