-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Labels
component/controllerRelated to the operator controllerRelated to the operator controllerenhancementNew feature or requestNew feature or requesttech-debtTechnical debt and code cleanupTechnical debt and code cleanup
Description
Context
internal/controller/model_controller.go has a checkAcceleratorAvailability() method that always returns true with a TODO comment:
// TODO: implement actual GPU/Metal availability checking
func (r *ModelReconciler) checkAcceleratorAvailability(hardware *HardwareSpec) bool {
if hardware == nil {
return true
}
return true
}Problem
status.acceleratorReady is always true regardless of whether the requested accelerator (CUDA, Metal, ROCm) is actually available on the target node. This can mislead users into thinking GPU acceleration is active when it isn't.
Proposed Solution
- For CUDA: Check if
nvidia.com/gpuresource is available on nodes (via node capacity or NVIDIA device plugin) - For Metal: Check if Metal agent is reachable
- For ROCm: Check if
amd.com/gpuresource is available - Set
status.acceleratorReady = falsewith a condition when unavailable
Location
internal/controller/model_controller.go:450
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
component/controllerRelated to the operator controllerRelated to the operator controllerenhancementNew feature or requestNew feature or requesttech-debtTechnical debt and code cleanupTechnical debt and code cleanup