Instance types are an Azure Machine Learning concept that allows targeting certain types of
compute nodes for training and inference workloads. For an Azure VM, an example for an
instance type is STANDARD_D2_V3.
In Kubernetes clusters, instance types are represented by two elements:
nodeSelector
and resources.
In short, a nodeSelector lets us specify which node a pod should run on. The node must have a
corresponding label. In the resources section, we can set the compute resources (CPU, memory and
Nvidia GPU) for the pod.
Instance types are represented in a custom resource definition (CRD) that is installed with the Azure Machine Learning extension. To create a new instance type, create a new custom resource for the instance type CRD. For example:
kubectl apply -f my_instance_type.yamlWith my_instance_type.yaml:
apiVersion: amlarc.azureml.com/v1alpha1
kind: InstanceType
metadata:
name: myinstancetypename
spec:
nodeSelector:
mylabel: mylabelvalue
resources:
limits:
cpu: "1"
nvidia.com/gpu: 1
memory: "2Gi"
requests:
cpu: "700m"
memory: "1500Mi"This creates an instance type with the following behavior:
- Pods will be scheduled only on nodes with label
mylabel: mylabelvalue. - Pods will be assigned resource requests of
700mCPU and1500Mimemory. - Pods will be assigned resource limits of
1CPU,2Gimemory and1Nvidia GPU.
Note:
- Nvidia GPU resources are only specified in the
limitssection as integer values. For more information, please refer to the Kubernetes documentation. - CPU and memory resources are string values.
- CPU can be specified in millicores, for example
100m, or in full numbers, for example"1"which is equivalent to1000m. - Memory can be specified as a full number + suffix, for example
1024Mifor 1024 MiB.
It is also possible to create multiple instance types at once:
kubectl apply -f my_instance_type_list.yamlWith my_instance_type_list.yaml:
apiVersion: amlarc.azureml.com/v1alpha1
kind: InstanceTypeList
items:
- metadata:
name: cpusmall
spec:
resources:
requests:
cpu: "100m"
memory: "100Mi"
limits:
cpu: "1"
nvidia.com/gpu: 0
memory: "1Gi"
- metadata:
name: defaultinstancetype
spec:
resources:
requests:
cpu: "1"
memory: "1Gi"
limits:
cpu: "1"
nvidia.com/gpu: 0
memory: "1Gi"The above example creates two instance types: cpusmall and defaultinstancetype. The latter
is examplained in more detail in the following section.
If a training or inference workload is submitted without an instance type, it uses the default
instance type. To specify a default instance type for a Kubernetes cluster, create an instance
type with name defaultinstancetype. It will automatically be recognized as the default.
If no default instance type was defined, the following default behavior applies:
- No nodeSelector is applied, meaning the pod can get scheduled on any node.
- The workload's pods are assigned default resources with 0.6 cpu cores, 1536Mi memory and 0 GPU:
resources:
requests:
cpu: "0.6"
memory: "1536Mi"
limits:
cpu: "0.6"
memory: "1536Mi"
nvidia.com/gpu: null- This default instance type will not appear as an InstanceType custom resource in the cluster when running the command
kubectl get instancetype, but it will appear in all clients (UI, CLI, SDK).
Note: The default instance type purposefully uses little resources. To ensure all ML workloads run with appropriate resources, for example GPU resource, it is highly recommended to create custom instance types.
To select an instance type for a training job using CLI (V2), specify its name as part of the
compute section. For example:
command: python -c "print('Hello world!')"
environment:
image: library/python:latest
compute: azureml:<compute_target_name>
resources:
instance_type: <instance_type_name>In the above example, replace <compute_target_name> with the name of your Kubernetes compute
target and <instance_type_name> with the name of the instance type you wish to select.
To select an instance type for a model deployment using CLI (V2), specify its name deployment YAML. For example:
name: blue
app_insights_enabled: true
endpoint_name: <endpoint name>
model:
path: ./model/sklearn_mnist_model.pkl
code_configuration:
code: ./script/
scoring_script: score.py
instance_type: <instance type name>
environment:
conda_file: file:./model/conda.yml
image: mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:20210727.v1