-
Notifications
You must be signed in to change notification settings - Fork 49
Open
Description
What is the problem you're trying to solve
At present, the logic of the devices plugin is only to report the GPU resources on the node at the time of startup, but if the GPU on the node is abnormal ( such as card dropout or in the draining state, etc. ), the current devices plugin will not automatically update the amount of GPU resources reported.
If we encounter some common or known exceptions, I think we can let the devices report the latest available GPU resources.
Describe the solution you'd like
When encountering certain pre-identified anomalies, the action reported by the GPU can be re-trigger on the devices.
Or is this issue worthy of our attention and resolution? If anyone has alternative proposals or insights, I sincerely welcome in-depth discussions and exchanges on this matter.
Additional context
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels