-
Notifications
You must be signed in to change notification settings - Fork 7k
Doc: Explain how to use a RayJob with Kueue and ProvisioningRequest despite's GKE single PodSet limitation #59070
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…espite's GKE single PodSet limitation Signed-off-by: Fabio M. Graetz, Ph.D. <fabiograetz@googlemail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds important documentation for running RayJobs with Kueue on GKE, providing a workaround for GKE's single PodSet limitation. The addition of podSetMergePolicy and the explanatory note are good. My feedback aims to make the documentation more explicit about the requirement for identical head and worker node specifications to ensure the example is complete and works for users.
| :::{note} | ||
| Google Kubernetes Engine's queued provisioning feature currently supports only single PodSet per request. To circumvent this issue, we | ||
| set `podSetMergePolicy: IdenticalWorkloadSchedulingRequirements` in the `ProvisioningRequestConfig`. When giving the head node and the | ||
| worker nodes the same resource requirements, affinities, and tolerations, Kueue merges them into a single PodSet in the `ProvisioningRequest`. | ||
| ::: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The explanation of the workaround is helpful, but the instructions could be more explicit to ensure users can successfully follow the example. The note mentions that head and worker nodes need identical scheduling requirements, but the steps for deploying the RayJob don't guide the user on how to verify or configure this.
If the downloaded ray-job.pytorch-distributed-training.yaml doesn't have identical specs for the head and worker groups, the example will fail for users.
To improve clarity and ensure the example is robust, consider making the instructions more direct. For example, you could update the note like this:
:::{note}
Google Kubernetes Engine's queued provisioning feature currently supports only a single PodSet per request. To use a RayJob with both head and worker nodes, you must configure Kueue to merge the pods into a single PodSet.
This requires two things:
1. Set `podSetMergePolicy: IdenticalWorkloadSchedulingRequirements` in the `ProvisioningRequestConfig`, as shown in this example.
2. **Crucially, you must ensure your RayJob's head and worker group specs have identical resource requirements, affinities, and tolerations.** The example `ray-job.pytorch-distributed-training.yaml` must be configured this way for the job to be admitted by Kueue.
:::This change would make the requirements clearer to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.


Description
The documentation explains how to run RayJobs with Kueue and queued provisioning on GKE. The documented manifests only work when the RayJob only has a head node but no workers. If one adds workers, GKE rejects the ProvisioningRequest because it only supports a single PodSet per request currently.
This PR documents how to circumvent this issue.
Related issues
Closes #59068 57839
Additional information
Created feature request to allow multiple podsets in GKE's issue tracker https://issuetracker.google.com/issues/452882313