-
Notifications
You must be signed in to change notification settings - Fork 765
[GFD] Prevent GFD from crashing when device is unhealthy #1510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
elezar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is what you're doing here different to just ignoring errors during labelling?
1461168 to
db81e51
Compare
PTAL |
db81e51 to
8769dd8
Compare
Make Merge() resilient to individual labeler failures by logging errors as warnings and continuing with remaining labelers. This prevents GFD from crashing when devices go unhealthy (e.g., XID errors) and allows partial label generation. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
8769dd8 to
df34cb8
Compare
| labels, err := labeler.Labels() | ||
| if err != nil { | ||
| return nil, fmt.Errorf("error generating labels: %v", err) | ||
| klog.Warningf("Labeler at index %d failed, skipping: %v", i, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a nit: is the index useful in this case? Should we add a Name() or String() function to the labeler?
| labels, err := labeler.Labels() | ||
| if err != nil { | ||
| return nil, fmt.Errorf("error generating labels: %v", err) | ||
| klog.Warningf("Labeler at index %d failed, skipping: %v", i, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, does it make sense to use:
| klog.Warningf("Labeler at index %d failed, skipping: %v", i, err) | |
| klog.ErrorS(err, "skipping failed labeller", "index", i) |
insread?
| ) | ||
|
|
||
| // mockLabeler is a test double for the Labeler interface | ||
| type mockLabeler struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Why not use moq to generatre an fully-featured mock?
elezar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that since the errors are thrown when the NVML labeller is constructed this is not sufficient to address the issue that we are seeing. To confirm, use a mock NVML that raises errors in any of the functions that are used.
|
Closing in favor of #1561 |
Modified Merge() to log and skip failed labelers instead of aborting the entire pipeline. When a device fails (XID error, GPU lost, etc.), GFD continues with healthy devices and generates partial labels.