[Enhancement] Support for GPU state anomaly detection

### What is the problem you're trying to solve
At present, the logic of the  devices plugin is only to report the GPU resources on the node at the time of startup, but if the GPU on the node is abnormal ( such as card dropout or in the draining state, etc. ), the current devices plugin will not automatically update the amount of GPU resources reported. 
If we encounter some common or known exceptions, I think we can let the devices report the latest available GPU resources.

### Describe the solution you'd like
When encountering certain pre-identified anomalies, the action reported by the GPU can be re-trigger on the devices.

Or is this issue worthy of our attention and resolution? If anyone has alternative proposals or insights, I sincerely welcome in-depth discussions and exchanges on this matter.


### Additional context
_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] Support for GPU state anomaly detection #78

What is the problem you're trying to solve

Describe the solution you'd like

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Enhancement] Support for GPU state anomaly detection #78

Description

What is the problem you're trying to solve

Describe the solution you'd like

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions