Skip to content

Conversation

@elezar
Copy link
Member

@elezar elezar commented Dec 10, 2025

When enumerating devices, a single device that has errors causes GFD to fail -- skipping any remaining devices. This manifests as errors similar to:

E1105 17:45:39.017442       1 main.go:110] error creating labeler: error getting devices: error getting device handle for index '0': Unknown Error

This change pulls in changes from go-nvlib (vendored in locally) (see NVIDIA/go-nvlib#80), that allow errors in enumerating devices to be ignored and ensures that the device lib is constructed with the required option. A simple unit test demonstrates how these errors are handled to ensure that labels are still generated.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar force-pushed the skip-errors-in-gfd branch from 04b8a45 to 74b7cb7 Compare December 10, 2025 14:16
func newGFDRunner(cfg *Config, nvmllib nvml.Interface, vgpul vgpu.Interface, config *spec.Config) (*gfd, error) {
devicelib := device.New(nvmllib)
devicelib := device.New(nvmllib,
// TODO: Do we want to expose this as a config option?
Copy link
Contributor

@klueska klueska Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean as a flag / envvar to the CLI / in the config file?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants