Skip to content

[Enhancement] Support for GPU state anomaly detection #78

@XbaoWu

Description

@XbaoWu

What is the problem you're trying to solve

At present, the logic of the devices plugin is only to report the GPU resources on the node at the time of startup, but if the GPU on the node is abnormal ( such as card dropout or in the draining state, etc. ), the current devices plugin will not automatically update the amount of GPU resources reported.
If we encounter some common or known exceptions, I think we can let the devices report the latest available GPU resources.

Describe the solution you'd like

When encountering certain pre-identified anomalies, the action reported by the GPU can be re-trigger on the devices.

Or is this issue worthy of our attention and resolution? If anyone has alternative proposals or insights, I sincerely welcome in-depth discussions and exchanges on this matter.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions