Skip to content

Incorrect GPU Specification and Machine Type Mapping for A100 in Vertex API #37

@jeffhernandez1995

Description

@jeffhernandez1995

Hello,

I'd like to express my appreciation for the xmanager tool! However, I've noticed a couple of issues regarding the specification of the A100 GPU and its associated machine types in the Vertex API, which I'd like to bring to your attention:

  1. GPU Naming Discrepancy:
    According to the Google Cloud resource documentation, the correct name for the A100 GPU with 80GB is A100_80GB, not A100_80GIB. This naming inconsistency leads to an error when requesting this resource. Reference: Google Cloud Documentation . Additionally, I've attached an image from the documentation.
    Documentation Screenshot

  2. Incorrect API Call Formation:
    When the A100_80GIB is referenced in the Vertex API, it results in a string like 'NVIDIA_TESLA_A100_80GIB', whereas it should be NVIDIA_A100_80GB. I believe this error stems from the line: accelerator_type = 'NVIDIA_TESLA_' + str(resource).upper() in the vertex.py script .

  3. Machine Type Mismatch:
    The A100_80GB GPU should be associated with machine types such as 'a2-ultragpu-1g', 'a2-ultragpu-2g', 'a2-ultragpu-4g', and 'a2-ultragpu-8g'. However, the current specification only attempts to map A100 GPUs to the following machine types:

_A100_GPUS_TO_MACHINE_TYPE = {
    1: 'a2-highgpu-1g',
    2: 'a2-highgpu-2g',
    4: 'a2-highgpu-4g',
    8: 'a2-highgpu-8g',
    16: 'a2-megagpu-16g',
}

Thank you for your attention to this matter.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions