Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,7 @@ metadata:
name: rayjob-gpu-config
spec:
provisioningClassName: queued-provisioning.gke.io
podSetMergePolicy: IdenticalWorkloadSchedulingRequirements
managedResources:
- nvidia.com/gpu
---
Expand Down Expand Up @@ -155,6 +156,12 @@ kubectl apply -f kueue-resources.yaml
This example configures Kueue to orchestrate the gang scheduling of GPUs. However, you can use other resources such as CPU and memory.
:::

:::{note}
Google Kubernetes Engine's queued provisioning feature currently supports only single PodSet per request. To circumvent this issue, we
set `podSetMergePolicy: IdenticalWorkloadSchedulingRequirements` in the `ProvisioningRequestConfig`. When giving the head node and the
worker nodes the same resource requirements, affinities, and tolerations, Kueue merges them into a single PodSet in the `ProvisioningRequest`.
:::
Comment on lines +159 to +163
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The explanation of the workaround is helpful, but the instructions could be more explicit to ensure users can successfully follow the example. The note mentions that head and worker nodes need identical scheduling requirements, but the steps for deploying the RayJob don't guide the user on how to verify or configure this.

If the downloaded ray-job.pytorch-distributed-training.yaml doesn't have identical specs for the head and worker groups, the example will fail for users.

To improve clarity and ensure the example is robust, consider making the instructions more direct. For example, you could update the note like this:

:::{note}
Google Kubernetes Engine's queued provisioning feature currently supports only a single PodSet per request. To use a RayJob with both head and worker nodes, you must configure Kueue to merge the pods into a single PodSet.

This requires two things:
1.  Set `podSetMergePolicy: IdenticalWorkloadSchedulingRequirements` in the `ProvisioningRequestConfig`, as shown in this example.
2.  **Crucially, you must ensure your RayJob's head and worker group specs have identical resource requirements, affinities, and tolerations.** The example `ray-job.pytorch-distributed-training.yaml` must be configured this way for the job to be admitted by Kueue.
:::

This change would make the requirements clearer to the user.


## Deploy a RayJob

Download the RayJob that executes all the steps documented in [Fine-tune a PyTorch Lightning Text Classifier](https://docs.ray.io/en/master/train/examples/lightning/lightning_cola_advanced.html). The [source code](https://github.com/ray-project/kuberay/tree/master/ray-operator/config/samples/pytorch-text-classifier) is also in the KubeRay repository.
Expand Down