Skip to content

Conversation

@fg91
Copy link

@fg91 fg91 commented Nov 29, 2025

Description

The documentation explains how to run RayJobs with Kueue and queued provisioning on GKE. The documented manifests only work when the RayJob only has a head node but no workers. If one adds workers, GKE rejects the ProvisioningRequest because it only supports a single PodSet per request currently.

This PR documents how to circumvent this issue.

Related issues

Closes #59068 57839

Additional information

Created feature request to allow multiple podsets in GKE's issue tracker https://issuetracker.google.com/issues/452882313

…espite's GKE single PodSet limitation

Signed-off-by: Fabio M. Graetz, Ph.D. <fabiograetz@googlemail.com>
@fg91 fg91 requested review from a team as code owners November 29, 2025 08:10
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds important documentation for running RayJobs with Kueue on GKE, providing a workaround for GKE's single PodSet limitation. The addition of podSetMergePolicy and the explanatory note are good. My feedback aims to make the documentation more explicit about the requirement for identical head and worker node specifications to ensure the example is complete and works for users.

Comment on lines +159 to +163
:::{note}
Google Kubernetes Engine's queued provisioning feature currently supports only single PodSet per request. To circumvent this issue, we
set `podSetMergePolicy: IdenticalWorkloadSchedulingRequirements` in the `ProvisioningRequestConfig`. When giving the head node and the
worker nodes the same resource requirements, affinities, and tolerations, Kueue merges them into a single PodSet in the `ProvisioningRequest`.
:::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The explanation of the workaround is helpful, but the instructions could be more explicit to ensure users can successfully follow the example. The note mentions that head and worker nodes need identical scheduling requirements, but the steps for deploying the RayJob don't guide the user on how to verify or configure this.

If the downloaded ray-job.pytorch-distributed-training.yaml doesn't have identical specs for the head and worker groups, the example will fail for users.

To improve clarity and ensure the example is robust, consider making the instructions more direct. For example, you could update the note like this:

:::{note}
Google Kubernetes Engine's queued provisioning feature currently supports only a single PodSet per request. To use a RayJob with both head and worker nodes, you must configure Kueue to merge the pods into a single PodSet.

This requires two things:
1.  Set `podSetMergePolicy: IdenticalWorkloadSchedulingRequirements` in the `ProvisioningRequestConfig`, as shown in this example.
2.  **Crucially, you must ensure your RayJob's head and worker group specs have identical resource requirements, affinities, and tolerations.** The example `ray-job.pytorch-distributed-training.yaml` must be configured this way for the job to be admitted by Kueue.
:::

This change would make the requirements clearer to the user.

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Future-Outlier Future-Outlier added the go add ONLY when ready to merge, run all tests label Nov 29, 2025
@ray-gardener ray-gardener bot added docs An issue or change related to documentation core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Nov 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core docs An issue or change related to documentation go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants