Skip to content

Conversation

@elezar
Copy link
Member

@elezar elezar commented Dec 1, 2025

This change adds support for reading the detected
platform (if set to auto) from a platform override
file. This allows system administrators to explicitly
select a detected platform for tooling such as the
nvidia-container-toolkit, the k8s-device-plugin, and
k8s-dra-driver-gpu.

For example, by creating the following file:

$ cat /etc/nvidia-container-toolkit/platform-override
tegra

A tegra platform will ALWAYS be detected.

@elezar elezar marked this pull request as draft December 1, 2025 15:22
@elezar
Copy link
Member Author

elezar commented Dec 1, 2025

This also includes the commits from #78

@elezar elezar force-pushed the add-platform-override branch 2 times, most recently from 6ff48ea to d7bc09a Compare December 1, 2025 15:28
@rajatchopra
Copy link
Contributor

Why do we need an override? Whats the use case? We have enough tooling to determine the platform automatically don't we?

This change adds support for reading the detected
platform (if set to `auto`) from a platform override
file. This allows system administrators to explicitly
select a detected platform for tooling such as the
nvidia-container-toolkit, the k8s-device-plugin, and
k8s-dra-driver-gpu.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar force-pushed the add-platform-override branch from d7bc09a to db3c350 Compare December 2, 2025 14:03
@elezar
Copy link
Member Author

elezar commented Dec 2, 2025

Why do we need an override? Whats the use case? We have enough tooling to determine the platform automatically don't we?

We don't actually have enough information in certain cases. The primary example here is Tegra-based systems. Here we rely on heuristics that may become out of date and in those cases we want to be able to allow sys-admins to override this value so as to avoid requiring a release just to address this.

PlatformResolver: &platformResolver{
logger: o.logger,
platform: o.platform,
platform: o.normalizePlatform(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:)

Copy link
Collaborator

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Draft - LGTM

@elezar elezar marked this pull request as ready for review December 3, 2025 08:20
@elezar elezar requested review from jgehrcke and klueska December 3, 2025 08:21
//
// This function can be overridden for testing purposes.
var getPlaformOverride = func() (string, string) {
platformOverrideFile, err := os.Open("/etc/nvidia-container-toolkit/platform-override")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be pulled out to a constant.
Independently, it feels a bit weird to bake in a path referencing the nvidia-container-toolkit inside nvlib.

Copy link
Contributor

@klueska klueska Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we instead just have New() take the override as a string and leave the logic for including this file / how to parse it to the toolkit (possibly including helpers in nvlib for how to do the parsing, but not actually doing any parsing by default)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or instead of a raw string -- it takes the path to the override file

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, the path to file should come from the entity importing go-nvlib

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of having this defined directly here (or at least the default) is that we want consistency across all our components that use this platform detection logic. These are currently:

  • The NVIDIA Container Toolkit (for the nvidia-container-runtime and nvidia-ctk cdi generate)
  • The GPU Device Plugin
  • GPU Feature Discovery

In the longer term, I could also see the DRA driver for GPUs also pulling this in if we wanted to support Tegra-based systems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants