Skip to content

Conversation

@elezar
Copy link
Member

@elezar elezar commented Nov 28, 2025

Instead of requiring that all GPUs on an Tegra platform with NVML be iGPUs, we resolve it as a tegra platform if at least one device is an iGPU.

Note that this changes the semantics of platform resolution and may cause issues on Tegra-based systems with dGPUs installed IF the iGPU is also enumeratable by NVML.

@elezar elezar self-assigned this Nov 28, 2025
@elezar
Copy link
Member Author

elezar commented Nov 28, 2025

This is required for the change to NVIDIA/nvidia-container-toolkit#1461

@klueska
Copy link
Contributor

klueska commented Dec 1, 2025

The change itself seems straightforward. There is a change in semantics though -- now it checks if any device is an iGPU rather than ensuring that all devices are iGPUs. So long as that it the intent, then LGTM.

klueska
klueska previously approved these changes Dec 1, 2025
@elezar
Copy link
Member Author

elezar commented Dec 1, 2025

The change itself seems straightforward. There is a change in semantics though -- now it checks if any device is an iGPU rather than ensuring that all devices are iGPUs. So long as that it the intent, then LGTM.

Yes. This is the intent. There is a risk that we may end up detecting the wrong platform on older Tegra-based systems where dGPUs are present, but I think this is relatively low.

As a follow-up thought / quesiton: What are your thoughts on having some kind of "override" for this detection logic that is implmented at this level so as to be useful to all upstream consumers? Something like an /etc/nvidia-container-toolkit/platform file that includes the platform?

@elezar
Copy link
Member Author

elezar commented Dec 1, 2025

Before I merge this, one thought that I had is to check whether the nvml lib that we resolve looks like a dGPU NVML lib and to use this as a signal that we're dealing with dGPU (or Tegra platform). The basic logic would be to check whether the actual libraryname has the DRIVER VERSION as a suffix, or is just .1.

What are your thoughts on this?

@klueska
Copy link
Contributor

klueska commented Dec 1, 2025

The change itself seems straightforward. There is a change in semantics though -- now it checks if any device is an iGPU rather than ensuring that all devices are iGPUs. So long as that it the intent, then LGTM.

Yes. This is the intent. There is a risk that we may end up detecting the wrong platform on older Tegra-based systems where dGPUs are present, but I think this is relatively low.

As a follow-up thought / quesiton: What are your thoughts on having some kind of "override" for this detection logic that is implmented at this level so as to be useful to all upstream consumers? Something like an /etc/nvidia-container-toolkit/platform file that includes the platform?

This seems reasonable to me. If someone knows what they are doing, they can make sure their platform drops a file in here to inform the toolkit which path to take. What values would you envision this honoring to start with?

@klueska
Copy link
Contributor

klueska commented Dec 1, 2025

Before I merge this, one thought that I had is to check whether the nvml lib that we resolve looks like a dGPU NVML lib and to use this as a signal that we're dealing with dGPU (or Tegra platform). The basic logic would be to check whether the actual libraryname has the DRIVER VERSION as a suffix, or is just .1.

What are your thoughts on this?

I don't know enough about the difference between how the driver shows up on one system vs. the other to have a strong opinion. On the surface, this "feels" very brittle though.

@elezar
Copy link
Member Author

elezar commented Dec 1, 2025

Before I merge this, one thought that I had is to check whether the nvml lib that we resolve looks like a dGPU NVML lib and to use this as a signal that we're dealing with dGPU (or Tegra platform). The basic logic would be to check whether the actual libraryname has the DRIVER VERSION as a suffix, or is just .1.
What are your thoughts on this?

I don't know enough about the difference between how the driver shows up on one system vs. the other to have a strong opinion. On the surface, this "feels" very brittle though.

Yes, it is brittle. It's also based on heuristics and not on a stable API. I think having an "override" functionality as discussed in the other comment would give us more flexibility without making the logic here more complex.

@elezar
Copy link
Member Author

elezar commented Dec 1, 2025

The change itself seems straightforward. There is a change in semantics though -- now it checks if any device is an iGPU rather than ensuring that all devices are iGPUs. So long as that it the intent, then LGTM.

Yes. This is the intent. There is a risk that we may end up detecting the wrong platform on older Tegra-based systems where dGPUs are present, but I think this is relatively low.
As a follow-up thought / quesiton: What are your thoughts on having some kind of "override" for this detection logic that is implmented at this level so as to be useful to all upstream consumers? Something like an /etc/nvidia-container-toolkit/platform file that includes the platform?

This seems reasonable to me. If someone knows what they are doing, they can make sure their platform drops a file in here to inform the toolkit which path to take. What values would you envision this honoring to start with?

My initial thought would be that we have a file:

# The following allows the platform detection used by the NVIDIA Container Toolkit and 
# other applications to be overridden. This means that in cases where the heuristics 
# employed are insufficient a platform owner / user has the option to force the detected plaform.
# Valid values include:
# * 'nvml': NVML-based systems with only discrete GPUs
# * 'tegra': A Tegra-based system with one or more integrated GPUs
# * 'wsl': A WSL2-based system
tegra

We would ignore lines starting with # or empty lines.

I can start a follow-up PR where we can discuss this in more detail.

@klueska
Copy link
Contributor

klueska commented Dec 1, 2025

I can start a follow-up PR where we can discuss this in more detail.

Works for me.

}

// HasOnlyIntegratedGPUs checks whether all GPUs are iGPUs that use NVML.
// HasAnIntegratedGPU checks whether any GPU is an iGPUs that use NVML.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "use NVML" mean here?

I ask because as per the name, I'd expect "checks whether any GPU is an integrated GPU".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment as written is a little ambiguous. I have reworked it to (hopefully) be clearer.

The point is not that the GPUs "use NVML", but that NVML is present on the system and reports the integrated GPUs (e.g. in the nvidia-smi -L output). This wasn't always the case and changed with the release of Orin-based systems.

if !isIntegratedGPUName(name) {
return false, fmt.Sprintf("device %q does not use nvgpu module", name)
if !IsIntegratedGPUName(name) {
continue
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to 🤷 :

if IsIntegratedGPUName(name) {
    return ...
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I was still in the "I need to check all devices" mindset. The quick return makes the code simpler to reason about.

Instead of requiring that all GPUs on an Tegra platform with
NVML be iGPUs, we resolve it as a tegra platform if at least one
device is an iGPU.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
Copy link
Collaborator

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@elezar elezar merged commit d0f42ba into NVIDIA:main Dec 2, 2025
4 checks passed
@elezar elezar deleted the relax-tegra-resolution branch December 2, 2025 13:54
@elezar
Copy link
Member Author

elezar commented Dec 2, 2025

Before I merge this, one thought that I had is to check whether the nvml lib that we resolve looks like a dGPU NVML lib and to use this as a signal that we're dealing with dGPU (or Tegra platform). The basic logic would be to check whether the actual libraryname has the DRIVER VERSION as a suffix, or is just .1.
What are your thoughts on this?

I don't know enough about the difference between how the driver shows up on one system vs. the other to have a strong opinion. On the surface, this "feels" very brittle though.

Created #79 to continue the discussion on this.

Went ahead and merged this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants