Skip to content

Implement GPU/Metal availability checking in model controller #230

@Defilan

Description

@Defilan

Context

internal/controller/model_controller.go has a checkAcceleratorAvailability() method that always returns true with a TODO comment:

// TODO: implement actual GPU/Metal availability checking
func (r *ModelReconciler) checkAcceleratorAvailability(hardware *HardwareSpec) bool {
    if hardware == nil {
        return true
    }
    return true
}

Problem

status.acceleratorReady is always true regardless of whether the requested accelerator (CUDA, Metal, ROCm) is actually available on the target node. This can mislead users into thinking GPU acceleration is active when it isn't.

Proposed Solution

  • For CUDA: Check if nvidia.com/gpu resource is available on nodes (via node capacity or NVIDIA device plugin)
  • For Metal: Check if Metal agent is reachable
  • For ROCm: Check if amd.com/gpu resource is available
  • Set status.acceleratorReady = false with a condition when unavailable

Location

internal/controller/model_controller.go:450

Metadata

Metadata

Assignees

No one assigned

    Labels

    component/controllerRelated to the operator controllerenhancementNew feature or requesttech-debtTechnical debt and code cleanup

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions