Skip to content

Conversation

@ArangoGutierrez
Copy link
Collaborator

Improve the device health check system to prevent blocking, enable
graceful shutdown, and provide better error categorization. These
changes address stability issues in production environments with
multiple GPUs and bursty XID error scenarios.

@ArangoGutierrez ArangoGutierrez self-assigned this Dec 4, 2025
Improve the device health check system to prevent blocking, enable
graceful shutdown, and provide better error categorization. These
changes address stability issues in production environments with
multiple GPUs and bursty XID error scenarios.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Add buffered channels (64), non-blocking writes, graceful shutdown,
stats collection, and automatic device recovery detection (30s).

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Encapsulate health checking state into dedicated struct to improve
modularity and testability. This struct groups related data (device
maps, XID filtering, stats) and will enable focused methods for device
registration and event monitoring.

No behavior changes - struct is defined but not yet used.

Inspired by elezar/refactor-health approach.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Separate device registration logic into a focused method on
nvmlHealthProvider. This improves testability by allowing device
registration to be tested independently from the event monitoring loop.

The method handles:
- Getting device handles by UUID
- Checking supported event types
- Registering events with the event set
- Marking devices unhealthy on registration failures

Inspired by elezar/refactor-health (a6a9f18).

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Separate the event monitoring loop into a focused method on
nvmlHealthProvider. This preserves all robustness features:
- Context-based shutdown coordination
- Buffered event channel with goroutine receiver
- Granular error handling via callback
- Stats tracking for observability
- XID filtering
- MIG device support

The method is now testable independently from NVML initialization
and device registration. Error handling is injected as a callback
to maintain flexibility.

Inspired by elezar/refactor-health (a6a9f18).

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Improve documentation of checkHealth to clarify its role as the main
orchestrator that coordinates:
- NVML initialization and resource management
- Device placement mapping (MIG support)
- Health provider creation and configuration
- Event registration and monitoring
- Shutdown coordination and stats reporting

The function is now much more readable with clear delegation to
focused methods. All functionality preserved.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Add comprehensive unit test coverage for XID filtering:

Test Coverage:
- XID parsing logic (newHealthCheckXIDs) - 10 test cases
- XID filtering with environment variables - 5 test cases
- Default ignored XIDs validation
- Environment variable override behavior

Key Features:
- Tests XID filtering (13, 31, 43, 45, 68, 109 filtered by default)
- Validates 'all' and 'xids' keywords
- Verifies enabled overrides disabled
- All tests pass with -race flag

Inspired by elezar/refactor-health (dab53b9).

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Comment on lines +98 to +104
// CheckDeviceHealth performs a simple health check on a single device by
// verifying it can be accessed via NVML and responds to basic queries.
// This is used for recovery detection - if a previously unhealthy device
// passes this check, it's considered recovered. We intentionally keep this
// simple and don't try to classify XIDs as recoverable vs permanent - that's
// controlled via DP_DISABLE_HEALTHCHECKS / DP_ENABLE_HEALTHCHECKS env vars.
func (r *nvmlResourceManager) CheckDeviceHealth(d *Device) (bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree with this mechanism for transitioning the device back to healthy. This is an oversimplification and will lead to unhealthy devices being considered healthy.

For example, if a device becomes unhealth due to repteated ECC memory errors, it is LIKELY that query functions such as the device name will continue to succeed and result in the device being marked as healthy when it needs a RESET.

Before we add this logic to the device plugin let us properly define and agree upon how we are detecting health.

Futhermore, although the XID-based health checking is something that is a means to an end, our ideal state is that some other component decides whether a device is health and the device plugin responds to these signals. Defining the unhealthy -> healthy transition here goes against this premise.

return &x
}

func TestTriggerDeviceListUpdate_Phase2(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a matter of interest, what is Phase2? (Were these tests generated?)

// nvmlHealthProvider encapsulates the state and logic for NVML-based GPU
// health monitoring. This struct groups related data and provides focused
// methods for device registration and event monitoring.
type nvmlHealthProvider struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Why is the refactoring done AFTER the functional changes in this PR?

stats *healthCheckStats
}

// registerDeviceEvents registers NVML event handlers for all devices in the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this actually different from the changes proposed in a6a9f18?

Comment on lines +198 to +200
if result.ret == nvml.ERROR_TIMEOUT {
continue
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we even send the event in the case of a timeout?

Comment on lines +175 to +180
// Try to send event result, but respect context cancellation
select {
case <-ctx.Done():
return
case eventChan <- eventResult{event: e, ret: ret}:
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like the wrong way to try and ensure that the context has not been closed before sending to the event channel. What are we concerned about here? Is there a better way to ensure that this go routine terminates on the context being cancelled and doesn't block permenantly on the send?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit message mentions adding tests, but I only see code being removed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants