[WIP] Enhance health check robustness and observability #1554

ArangoGutierrez · 2025-12-04T11:38:48Z

Improve the device health check system to prevent blocking, enable
graceful shutdown, and provide better error categorization. These
changes address stability issues in production environments with
multiple GPUs and bursty XID error scenarios.

Improve the device health check system to prevent blocking, enable graceful shutdown, and provide better error categorization. These changes address stability issues in production environments with multiple GPUs and bursty XID error scenarios. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Add buffered channels (64), non-blocking writes, graceful shutdown, stats collection, and automatic device recovery detection (30s). Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Encapsulate health checking state into dedicated struct to improve modularity and testability. This struct groups related data (device maps, XID filtering, stats) and will enable focused methods for device registration and event monitoring. No behavior changes - struct is defined but not yet used. Inspired by elezar/refactor-health approach. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Separate device registration logic into a focused method on nvmlHealthProvider. This improves testability by allowing device registration to be tested independently from the event monitoring loop. The method handles: - Getting device handles by UUID - Checking supported event types - Registering events with the event set - Marking devices unhealthy on registration failures Inspired by elezar/refactor-health (a6a9f18). Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Separate the event monitoring loop into a focused method on nvmlHealthProvider. This preserves all robustness features: - Context-based shutdown coordination - Buffered event channel with goroutine receiver - Granular error handling via callback - Stats tracking for observability - XID filtering - MIG device support The method is now testable independently from NVML initialization and device registration. Error handling is injected as a callback to maintain flexibility. Inspired by elezar/refactor-health (a6a9f18). Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Improve documentation of checkHealth to clarify its role as the main orchestrator that coordinates: - NVML initialization and resource management - Device placement mapping (MIG support) - Health provider creation and configuration - Event registration and monitoring - Shutdown coordination and stats reporting The function is now much more readable with clear delegation to focused methods. All functionality preserved. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Add comprehensive unit test coverage for XID filtering: Test Coverage: - XID parsing logic (newHealthCheckXIDs) - 10 test cases - XID filtering with environment variables - 5 test cases - Default ignored XIDs validation - Environment variable override behavior Key Features: - Tests XID filtering (13, 31, 43, 45, 68, 109 filtered by default) - Validates 'all' and 'xids' keywords - Verifies enabled overrides disabled - All tests pass with -race flag Inspired by elezar/refactor-health (dab53b9). Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

elezar · 2025-12-10T12:54:19Z

internal/rm/nvml_manager.go

+// CheckDeviceHealth performs a simple health check on a single device by
+// verifying it can be accessed via NVML and responds to basic queries.
+// This is used for recovery detection - if a previously unhealthy device
+// passes this check, it's considered recovered. We intentionally keep this
+// simple and don't try to classify XIDs as recoverable vs permanent - that's
+// controlled via DP_DISABLE_HEALTHCHECKS / DP_ENABLE_HEALTHCHECKS env vars.
+func (r *nvmlResourceManager) CheckDeviceHealth(d *Device) (bool, error) {


I don't agree with this mechanism for transitioning the device back to healthy. This is an oversimplification and will lead to unhealthy devices being considered healthy.

For example, if a device becomes unhealth due to repteated ECC memory errors, it is LIKELY that query functions such as the device name will continue to succeed and result in the device being marked as healthy when it needs a RESET.

Before we add this logic to the device plugin let us properly define and agree upon how we are detecting health.

Futhermore, although the XID-based health checking is something that is a means to an end, our ideal state is that some other component decides whether a device is health and the device plugin responds to these signals. Defining the unhealthy -> healthy transition here goes against this premise.

elezar · 2025-12-10T12:55:13Z

internal/plugin/server_test.go

 	return &x
 }
+
+func TestTriggerDeviceListUpdate_Phase2(t *testing.T) {


As a matter of interest, what is Phase2? (Were these tests generated?)

elezar · 2025-12-10T12:56:00Z

internal/rm/health.go

+// nvmlHealthProvider encapsulates the state and logic for NVML-based GPU
+// health monitoring. This struct groups related data and provides focused
+// methods for device registration and event monitoring.
+type nvmlHealthProvider struct {


Question: Why is the refactoring done AFTER the functional changes in this PR?

elezar · 2025-12-10T12:58:17Z

internal/rm/health.go

 	stats *healthCheckStats
 }

+// registerDeviceEvents registers NVML event handlers for all devices in the


How is this actually different from the changes proposed in a6a9f18?

elezar · 2025-12-10T13:02:43Z

internal/rm/health.go

+			if result.ret == nvml.ERROR_TIMEOUT {
+				continue
+			}


Why do we even send the event in the case of a timeout?

elezar · 2025-12-10T13:12:09Z

internal/rm/health.go

+			// Try to send event result, but respect context cancellation
+			select {
+			case <-ctx.Done():
+				return
+			case eventChan <- eventResult{event: e, ret: ret}:
+			}


This seems like the wrong way to try and ensure that the context has not been closed before sending to the event channel. What are we concerned about here? Is there a better way to ensure that this go routine terminates on the context being cancelled and doesn't block permenantly on the send?

elezar · 2025-12-10T13:16:26Z

internal/rm/health_test.go

The commit message mentions adding tests, but I only see code being removed here.

ArangoGutierrez self-assigned this Dec 4, 2025

ArangoGutierrez added 8 commits December 4, 2025 12:40

feat: health check robustness and auto-recovery

1de3157

Add buffered channels (64), non-blocking writes, graceful shutdown, stats collection, and automatic device recovery detection (30s). Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

[no-relnote] update vendor

79e665e

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

ArangoGutierrez force-pushed the feature/modular-health-check branch from 4749eae to 79e665e Compare December 4, 2025 11:40

ArangoGutierrez mentioned this pull request Dec 4, 2025

WIP: Refactor health checks #1538

Draft

elezar reviewed Dec 10, 2025

View reviewed changes

internal/rm/health_test.go

Copy link

Member

elezar Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit message mentions adding tests, but I only see code being removed here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Enhance health check robustness and observability #1554

[WIP] Enhance health check robustness and observability #1554

Uh oh!

ArangoGutierrez commented Dec 4, 2025

Uh oh!

elezar Dec 10, 2025

Uh oh!

elezar Dec 10, 2025

Uh oh!

elezar Dec 10, 2025

Uh oh!

elezar Dec 10, 2025

Uh oh!

elezar Dec 10, 2025

Uh oh!

elezar Dec 10, 2025

Uh oh!

elezar Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[WIP] Enhance health check robustness and observability #1554

Are you sure you want to change the base?

[WIP] Enhance health check robustness and observability #1554

Uh oh!

Conversation

ArangoGutierrez commented Dec 4, 2025

Uh oh!

elezar Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

elezar Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

elezar Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

elezar Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

elezar Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

elezar Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

elezar Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants