From c0bce5c6579d1c55789c0b7c08a54cedeb6da8f7 Mon Sep 17 00:00:00 2001 From: Ania Borowiec Date: Mon, 6 Oct 2025 10:03:52 +0000 Subject: [PATCH 1/5] prr questionnaire updated --- .../5278-nominated-node-name-for-expectation/README.md | 6 +++--- .../5278-nominated-node-name-for-expectation/kep.yaml | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md b/keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md index 6d9a13f4400..31be7be3194 100644 --- a/keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md +++ b/keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md @@ -630,7 +630,7 @@ Pods that are processed by Permit or PreBind plugins get NominatedNodeName durin ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? Yes. -The feature can be disabled in Alpha version by restarting the kube-scheduler and kube-apiserver with the feature-gate off. +The feature can be disabled in Beta version by restarting the kube-scheduler and kube-apiserver with the feature-gate off. ###### What happens if we reenable the feature if it was previously rolled back? @@ -752,8 +752,8 @@ No. ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? Yes - but it should be negligible impact. -The memory usage in kube-scheduler is supposed to increase by external components starting to use this -because when `NominatedNodeName` is added on the pods, the scheduler's internal component called `nominator` has to record them so that scheduling cycles can refer to them as necessary. +The memory usage in kube-scheduler is supposed to increase because when `NominatedNodeName` is added on the pods, the scheduler's +internal component called `nominator` has to record them so that scheduling cycles can refer to them as necessary. ###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? diff --git a/keps/sig-scheduling/5278-nominated-node-name-for-expectation/kep.yaml b/keps/sig-scheduling/5278-nominated-node-name-for-expectation/kep.yaml index 212a884713f..66154d24781 100644 --- a/keps/sig-scheduling/5278-nominated-node-name-for-expectation/kep.yaml +++ b/keps/sig-scheduling/5278-nominated-node-name-for-expectation/kep.yaml @@ -17,7 +17,7 @@ approvers: stage: alpha -latest-milestone: "v1.34" +latest-milestone: "v1.35" milestone: alpha: "v1.34" From e0345cd755773ece7e4ac3d6da1f93404dfd777b Mon Sep 17 00:00:00 2001 From: Ania Borowiec Date: Tue, 7 Oct 2025 09:14:36 +0000 Subject: [PATCH 2/5] prr approver updated --- keps/prod-readiness/sig-scheduling/5278.yaml | 2 ++ 1 file changed, 2 insertions(+) diff --git a/keps/prod-readiness/sig-scheduling/5278.yaml b/keps/prod-readiness/sig-scheduling/5278.yaml index b3de9d3e79c..292cbad97c4 100644 --- a/keps/prod-readiness/sig-scheduling/5278.yaml +++ b/keps/prod-readiness/sig-scheduling/5278.yaml @@ -1,3 +1,5 @@ kep-number: 5278 alpha: approver: "@soltysh" +beta: + approver: "@soltysh" From 177e10577c395649769ea65b65f0a87ac4d20675 Mon Sep 17 00:00:00 2001 From: Ania Borowiec Date: Wed, 8 Oct 2025 11:03:48 +0000 Subject: [PATCH 3/5] update stage in kep.yaml --- .../5278-nominated-node-name-for-expectation/kep.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-scheduling/5278-nominated-node-name-for-expectation/kep.yaml b/keps/sig-scheduling/5278-nominated-node-name-for-expectation/kep.yaml index 66154d24781..4abac9766ea 100644 --- a/keps/sig-scheduling/5278-nominated-node-name-for-expectation/kep.yaml +++ b/keps/sig-scheduling/5278-nominated-node-name-for-expectation/kep.yaml @@ -15,7 +15,7 @@ approvers: - "@macsko" - "@dom4ha" -stage: alpha +stage: beta latest-milestone: "v1.35" From 7d504ebb5dcd4ea5ec0c1780857e596fd35f4a5c Mon Sep 17 00:00:00 2001 From: Ania Borowiec Date: Thu, 9 Oct 2025 10:40:01 +0000 Subject: [PATCH 4/5] update kep to narrow down the scope --- .../README.md | 406 +++++++++++------- 1 file changed, 248 insertions(+), 158 deletions(-) diff --git a/keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md b/keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md index 31be7be3194..f811ffce52b 100644 --- a/keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md +++ b/keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md @@ -81,7 +81,6 @@ tags, and then generate with `hack/update-toc.sh`. - [Summary](#summary) - [Motivation](#motivation) - [External components need to know where the pod is going to be bound](#external-components-need-to-know-where-the-pod-is-going-to-be-bound) - - [External components want to specify a preferred pod placement](#external-components-want-to-specify-a-preferred-pod-placement) - [Retain the scheduling decision](#retain-the-scheduling-decision) - [Goals](#goals) - [Non-Goals](#non-goals) @@ -89,21 +88,16 @@ tags, and then generate with `hack/update-toc.sh`. - [User Stories (Optional)](#user-stories-optional) - [Story 1: Prevent inappropriate scale downs by Cluster Autoscaler](#story-1-prevent-inappropriate-scale-downs-by-cluster-autoscaler) - [Story 2: Scheduler can resume its work after restart](#story-2-scheduler-can-resume-its-work-after-restart) - - [Story 3: ClusterAutoscaler or Karpenter can influence scheduling decisions](#story-3-clusterautoscaler-or-karpenter-can-influence-scheduling-decisions) - - [Story 4: Kueue specifies NominatedNodeName to indicate where it prefers pods being scheduled to](#story-4-kueue-specifies-nominatednodename-to-indicate-where-it-prefers-pods-being-scheduled-to) - [Risks and Mitigations](#risks-and-mitigations) - - [NominatedNodeName can already be set by other components now.](#nominatednodename-can-already-be-set-by-other-components-now) + - [NominatedNodeName can be set by other components now.](#nominatednodename-can-be-set-by-other-components-now) - [Confusing semantics of NominatedNodeName](#confusing-semantics-of-nominatednodename) - [Increasing the load to kube-apiserver](#increasing-the-load-to-kube-apiserver) - - [Race condition](#race-condition) - [Confusion if NominatedNodeName is different from NodeName after all](#confusion-if-nominatednodename-is-different-from-nodename-after-all) - - [What if there are multiple components that could set NominatedNodeName on the same pod](#what-if-there-are-multiple-components-that-could-set-nominatednodename-on-the-same-pod) - - [Invalid NominatedNodeName prevents the pod from scheduling](#invalid-nominatednodename-prevents-the-pod-from-scheduling) - [Design Details](#design-details) - [The scheduler puts NominatedNodeName](#the-scheduler-puts-nominatednodename) - [External components put NominatedNodeName](#external-components-put-nominatednodename) - [The scheduler's cache for NominatedNodeName](#the-schedulers-cache-for-nominatednodename) - - [The scheduler only modifies NominatedNodeName, does not clear it in any case](#the-scheduler-only-modifies-nominatednodename-does-not-clear-it-in-any-case) + - [The scheduler clears NominatedNodeName after scheduling failure](#the-scheduler-clears-nominatednodename-after-scheduling-failure) - [Kube-apiserver clears NominatedNodeName when receiving binding requests](#kube-apiserver-clears-nominatednodename-when-receiving-binding-requests) - [Test Plan](#test-plan) - [Prerequisite testing updates](#prerequisite-testing-updates) @@ -126,6 +120,14 @@ tags, and then generate with `hack/update-toc.sh`. - [Drawbacks](#drawbacks) - [Alternatives](#alternatives) - [Introduce a new field](#introduce-a-new-field) + - [Allow NominatedNodeName to be set by other components](#allow-nominatednodename-to-be-set-by-other-components) + - [Motivation: External components want to specify a preferred pod placement](#motivation-external-components-want-to-specify-a-preferred-pod-placement) + - [Goals](#goals-1) + - [Non-Goals](#non-goals-1) + - [User stories](#user-stories) + - [Risks and Mitigations](#risks-and-mitigations-1) + - [Design Details](#design-details-1) + - [Test plan: Integration tests](#test-plan-integration-tests) - [Infrastructure Needed (Optional)](#infrastructure-needed-optional) @@ -192,10 +194,9 @@ updates. [documentation style guide]: https://github.com/kubernetes/community/blob/master/contributors/guide/style-guide.md --> -Use `NominatedNodeName` to express pod placement, expected by the scheduler or expected by other components. +Use `NominatedNodeName` to express pod placement, expected by the scheduler. Besides of using `NominatedNodeName` to indicate ongoing preemption, the scheduler can specify it at the beginning of a binding cycle to show an expected pod placement to other components. -And, also other components can put `NominatedNodeName` on pending pods to indicate the pod is preferred to be scheduled on a specific node. ## Motivation @@ -212,16 +213,6 @@ misunderstands the node is low-utilized (because the scheduler keeps the place o We can expose those internal reservations with `NominatedNodeName` so that external components can take a more appropriate action based on the expected pod placement. -### External components want to specify a preferred pod placement - -The ClusterAutoscaler or Karpenter internally calculate the pod placement, -and create new nodes or un-gate pods based on the calculation result. -The shape and count of newly added nodes assumes some particular pod placement -and the pods may not fit or satisfy scheduling constraints if placed differently. - -By specifying their expectation on `NominatedNodeName`, the scheduler can first check -whether the pod can go to the nominated node, reducing end-to-end scheduling time. - ### Retain the scheduling decision At the binding cycle (e.g., PreBind), some plugins could handle something (e.g., volumes, devices) based on the pod's scheduling result. @@ -234,13 +225,12 @@ and the PreBind plugins can restart their work from where they were before the r ### Goals - The scheduler will use `NominatedNodeName` to express where the pod is going to go before actually binding them. -- Make sure external components can use `NominatedNodeName` to express where they prefer the pod is going to. - - Probably, you can do this with a today's scheduler as well. This proposal wants to discuss/make sure if it actually works, and then add tests etc. ### Non-Goals -- External components can enforce the scheduler to pick up a specific node via `NominatedNodeName`. - - `NominatedNodeName` is just a hint for scheduler and doesn't represent a hard requirement +- External components can suggest a specific node to kube-scheduler using `NominatedNodeName`. + - This is not in scope of this feature for the time being. See the alternatives section for more details. + ## Proposal @@ -249,8 +239,6 @@ and the PreBind plugins can restart their work from where they were before the r Here is the all use cases of NominatedNodeNames that we're taking into consideration: - The scheduler puts it after the preemption (already implemented) - The scheduler puts it at the beginning of binding cycles (only if the binding cycles involve PreBind or WaitOnPermit phase) -- The ClusterAutoscaler or Karpenter puts it after creating a new node for pending pod(s) so that the scheduler - can utilize the result of scheduling simulations already made by those components (Possibly, our future initiative around the workload scheduling (including gang scheduling) can also utilize it, but we don't discuss it here because it's not yet concrete.) @@ -274,33 +262,15 @@ the work that has already been done and increase the end-to-end pod startup late We need a mechanism to be able to resume the already started work in majority of such situations. -#### Story 3: ClusterAutoscaler or Karpenter can influence scheduling decisions - -ClusterAutoscaler or Karpenter perform scheduling simulations to decide what nodes should be -added to make pending pods schedulable. Their decisions assume a certain placement - if pending -pods are placed differently, they may not fit on the newly added nodes or may not satisfy their -scheduling constraints. - -In order to improve the end-to-end pod startup latency when cluster scale-up is needed, we need a -mechanism to communicate the results of scheduling simulations from ClusterAutoscaler or Karpenter -to scheduler. - -#### Story 4: Kueue specifies `NominatedNodeName` to indicate where it prefers pods being scheduled to - -Kueue supports scheduling features that are not (yet) supported in core scheduling, such as topology-aware scheduling. -When it determines the optimal placement, it needs a mechanism to pass that information to the scheduler. -Currently it is using NodeSelector to enforce placement of pods and only then ungates the pods. Scheduler doesn't take that information into account until pods are ungated and can schedule other pods in those places in the meantime. -It would be beneficial to pass that information to scheduler sooner, as well as allow scheduler to change the decision if the topology constraints are just the soft ones. - ### Risks and Mitigations -#### NominatedNodeName can already be set by other components now. +#### NominatedNodeName can be set by other components now. -There aren't any guardrails preventing other components from setting NominatedNodeName now. +There aren't any guardrails preventing other components from setting `NominatedNodeName` now. In such cases, the semantic is not well defined now and the outcome of it may not match user expectations. -This KEP is a step towards clarifying this semantic instead of maintaining status-quo. +This KEP is a step towards clarifying this semantic and scheduler's behavior instead of maintaining status-quo. #### Confusing semantics of `NominatedNodeName` @@ -308,54 +278,35 @@ Up until now, `NominatedNodeName` was expressing the decision made by scheduler pod on a given node, while waiting for the preemption. The decision could be changed later so it didn't have to be a final decision, but it was describing the "current plan of record". -If we put more components into the picture (e.g. ClusterAutoscaler and Karpenter), we effectively -get a more complex state machine, with the following states: +If we add the case of delayed binding, we effectively get a state machine with the following states: 1. pending pod -2. pod proposed to node (by external component) [not approved by scheduler] -3. pod nominated to node (based on external proposal) and waiting for node (e.g. being created & ready) -4. pod nominated to node and waiting for preemption -5. pod allocated to node and waiting for binding -6. pod bound +2. pod nominated to node and waiting for preemption +3. pod allocated to node and waiting for binding +4. pod bound -The important part is that if we decide to use `NominatedNodeName` to store all that information, +The important part is that if we decide to use `NominatedNodeName` to store information for both (2) and (3), we're effectively losing the ability to distinguish between those states. We may argue that as long as the decision was made by the scheduler, the exact reason and state probably isn't that important - the content of `NominatedNodeName` can be interpreted as "current plan of record for this pod from scheduler perspective". -But the `pod proposed to node` state is visibly different. In particular external components -may overallocate the pods on the node, those pods may not match scheduling constraints etc. -We can't claim that it's a current plan of record of the scheduler. It's a hint that we want -scheduler to take into account. - -In other words, from state machine perspective, there is visible difference in who sets the -`NominatedNodeName`. If it was scheduler, it may mean that there is already ongoing preemption. -If it was an external component, it's just a hint that may even be ignored. -However, if we look from consumption point of view - these are effectively the same. We want +If we look from consumption point of view - these are effectively the same. We want to expose the information, that as of now a given node is considered as a potential placement for a given pod. It may change, but for now that's what considered. -Eventually, we may introduce some state machine, where external components could also approve -schedulers decisions by exposing these states more concretely via the API. But we will be -able to achieve it in an additive way by exposing the information about the state. - -However, we don't need this state machine now, so we just introduce the following rules: -- Any component can set `NominatedNodeName` if it is currently unset. +On top of the simple state machine above we introduce the following rules: - Scheduler is allowed to overwrite `NominatedNodeName` at any time in case of preemption or the beginning of the binding cycle. -- No external components can overwrite `NominatedNodeName` set by a different component. -- If `NominatedNodeName` is set, the component who set it is responsible for updating or -clearing it if its plans were changed (using PUT or APPLY to ensure it won't conflict with -potential update from scheduler) to reflect the new hint. +- No external components are expected to overwrite `NominatedNodeName` set by the scheduler (although technically there are no guardrails). Moreover: - Regardless of who set `NominatedNodeName`, its readers should always take that into consideration (e.g. ClusterAutoscaler or Karpenter when trying to scale down nodes). -- In case of faulty components (e.g. overallocation the nodes), these decisions will -simply be rejected by the scheduler (although the `NominatedNodeName` will remain set -for the unschedulability period). +- In case of faulty components (e.g. overallocation of nodes), these decisions will +simply be rejected by the scheduler (and the `NominatedNodeName` will be cleared before +moving the rejected pod to unschedulable). #### Increasing the load to kube-apiserver @@ -372,14 +323,6 @@ For cases with delayed binding, we make an argument that the additional calls ar there are other calls related to those operations (e.g. PV creation, PVC binding, etc.) - so the overhead of setting `NNN` is a smaller percentage of the whole e2e pod startup flow. -#### Race condition - -If an external component adds `NominatedNodeName` to the pod that is going through a scheduling cycle, -`NominatedNodeName` isn't taken into account (of course), and the pod could be scheduled onto a different node. - -But, this should be fine because, either way, we're not saying `NominatedNodeName` is something -forcing the scheduler to pick up the node, rather it's just a preference. - #### Confusion if `NominatedNodeName` is different from `NodeName` after all If an external component adds `NominatedNodeName`, but the scheduler picks up a different node, @@ -393,42 +336,6 @@ We will update the logic so that `NominatedNodeName` field is cleared during `bi We believe that ensuring that `NominatedNodeName` can't be set after the pod is already bound is niche enough feature that doesn't justify an attempt to strengthening the validation. -#### What if there are multiple components that could set `NominatedNodeName` on the same pod - -It's not something newly introduced by this KEP because anyone can set NominatedNodeName today, -but discuss here to form our suggestion. - -Multiple controllers might keep overwriting NominatedNodeName that is set by the others. -Of course, we can regard that just as user's fault though, that'd be undesired situation. - -There could be several ideas to mitigate, or even completely solve by adding a new API. -But, we wouldn't like to introduce any complexity right now because we're not sure how many users would start using this, -and hit this problem. - -So, for now, we'll just document it somewhere as a risk, unrecommended situation, -and in the future, we'll consider something -if we actually observe this problem getting bigger by many people starting using it. - -#### Invalid `NominatedNodeName` prevents the pod from scheduling - -Currently, `NominatedNodeName` field is cleared at the end of failed scheduling cycle if it found the nominated node -unschedulable for the pod. However, in order to make it work for ClusterAutoscaler and Karpenter, we will remove this -logic, and `NominatedNodeName` could stay on the node forever, despite not being a valid suggestions anymore. -As an example, imagine a scenario, where ClusterAutoscaler created a new node a nominated a pod to it, but -before that pod was scheduled, a new higher-priority pod appeared and used the space on that newly created node. -In such a case, it all worked as expected, but we ended up with `NominatedNodeName` set uncorrectly. - -As a mitigation: -- an external component that originally set the `NominatedNodeName` is responsible for clearing or updating -the field to reflect the state -- if it won't happen, given that `NominatedNodeName` is just a hint for scheduler, it will continue to processing -the pod just having a minor performance hit (trying to process a node set via `NNN` first, but falling back to -all nodes anyway). We claim that the additional cost of checking `NominatedNodeName` first is acceptable (even -for big clusters where the performance is critical) because it's just one iteration of Filter plugins -(e.g., if you have 1000 nodes and 16 parallelism (default value), the scheduler needs around 62 iterations of -Filter plugins, approximately. So, adding one iteration on top of that doesn't matter). - - ## Design Details - #### Introduce a new field Instead of using `NominatedNodeName` to let external components to hint scheduler, we considered @@ -799,6 +676,219 @@ external components it doesn't eliminate the potential races either. If in the f that distinsuighing that is needed, we believe that we can model such state muchine with an additional field in a purely additive way. +#### Allow NominatedNodeName to be set by other components + +In v1.35 this feature is being narrowed down to one-way communication: only kube-scheduler is allowed to set `NominatedNodeName`, +while for other components this field should be read-only. + +The alternative to consider for future releases is that other components can set `NominatedNodeName` in pending pods to indicate +the pod is preferred to be scheduled on a specific node. + +##### Motivation: External components want to specify a preferred pod placement + +The ClusterAutoscaler or Karpenter internally calculate the pod placement, +and create new nodes or un-gate pods based on the calculation result. +The shape and count of newly added nodes assumes some particular pod placement +and the pods may not fit or satisfy scheduling constraints if placed differently. + +By specifying their expectation on `NominatedNodeName`, the scheduler can first check +whether the pod can go to the nominated node, reducing end-to-end scheduling time. + +##### Goals + +- Make sure external components can use `NominatedNodeName` to express where they prefer the pod is going to. + - Probably, you can do this with a today's scheduler as well. This proposal wants to discuss/make sure if it actually works, and then add tests etc. + +##### Non-Goals + +- External components can enforce the scheduler to pick up a specific node via `NominatedNodeName`. + - `NominatedNodeName` is just a hint for scheduler and doesn't represent a hard requirement + +##### User stories + +The use case supported by this feature is: +- The ClusterAutoscaler or Karpenter sets `NominatedNodeName` after creating a new node for pending pod(s), so that the scheduler + can utilize the result of scheduling simulations already calculated by those components + +###### Story 1: ClusterAutoscaler or Karpenter can influence scheduling decisions + +ClusterAutoscaler or Karpenter perform scheduling simulations to decide what nodes should be +added to make pending pods schedulable. Their decisions assume a certain placement - if pending +pods are placed differently, they may not fit on the newly added nodes or may not satisfy their +scheduling constraints. + +In order to improve the end-to-end pod startup latency when cluster scale-up is needed, we need a +mechanism to communicate the results of scheduling simulations from ClusterAutoscaler or Karpenter +to scheduler. + +###### Story 2: Kueue specifies `NominatedNodeName` to indicate where it prefers pods being scheduled to + +Kueue supports scheduling features that are not (yet) supported in core scheduling, such as topology-aware scheduling. +When it determines the optimal placement, it needs a mechanism to pass that information to the scheduler. +Currently it is using NodeSelector to enforce placement of pods and only then ungates the pods. Scheduler doesn't take that information into account until pods are ungated and can schedule other pods in those places in the meantime. +It would be beneficial to pass that information to scheduler sooner, as well as allow scheduler to change the decision if the topology constraints are just the soft ones. + +##### Risks and Mitigations + +###### NominatedNodeName can be set by other components now. + +There aren't any guardrails preventing other components from setting `NominatedNodeName` now. +In such cases, the semantic is not well defined now and the outcome of it may not match user +expectations. + +This section is a step towards clarifying this semantic instead of maintaining status-quo. + +###### Confusing semantics of `NominatedNodeName` + +Up until now, `NominatedNodeName` was expressing the decision made by scheduler to put a given +pod on a given node, while waiting for the preemption. The decision could be changed later so +it didn't have to be a final decision, but it was describing the "current plan of record". + +If we put more components into the picture (e.g. ClusterAutoscaler and Karpenter), we effectively +get a more complex state machine, with the following states: + +1. pending pod +2. pod proposed to node (by external component) [not approved by scheduler] +3. pod nominated to node (based on external proposal) and waiting for node (e.g. being created & ready) +4. pod nominated to node and waiting for preemption +5. pod allocated to node and waiting for binding +6. pod bound + +The important part is that if we decide to use `NominatedNodeName` to store all that information, +we're effectively losing the ability to distinguish between those states. + +We may argue that as long as the decision was made by the scheduler, the exact reason and state +probably isn't that important - the content of `NominatedNodeName` can be interpreted as +"current plan of record for this pod from scheduler perspective". + +But the `pod proposed to node` state is visibly different. In particular external components +may overallocate the pods on the node, those pods may not match scheduling constraints etc. +We can't claim that it's a current plan of record of the scheduler. It's a hint that we want +scheduler to take into account. + +In other words, from state machine perspective, there is visible difference in who sets the +`NominatedNodeName`. If it was scheduler, it may mean that there is already ongoing preemption. +If it was an external component, it's just a hint that may even be ignored. +However, if we look from consumption point of view - these are effectively the same. We want +to expose the information, that as of now a given node is considered as a potential placement +for a given pod. It may change, but for now that's what considered. + +Eventually, we may introduce some state machine, where external components could also approve +schedulers decisions by exposing these states more concretely via the API. But we will be +able to achieve it in an additive way by exposing the information about the state. + +However, we don't need this state machine now, so we just introduce the following rules: +- Any component can set `NominatedNodeName` if it is currently unset. +- Scheduler is allowed to overwrite `NominatedNodeName` at any time in case of preemption or +the beginning of the binding cycle. +- No external components can overwrite `NominatedNodeName` set by a different component. +- If `NominatedNodeName` is set, the component who set it is responsible for updating or +clearing it if its plans were changed (using PUT or APPLY to ensure it won't conflict with +potential update from scheduler) to reflect the new hint. + +Moreover: +- Regardless of who set `NominatedNodeName`, its readers should always take that into +consideration (e.g. ClusterAutoscaler or Karpenter when trying to scale down nodes). +- In case of faulty components (e.g. overallocation the nodes), these decisions will +simply be rejected by the scheduler (although the `NominatedNodeName` will remain set +for the unschedulability period). + +###### Race condition + +If an external component adds `NominatedNodeName` to the pod that is going through a scheduling cycle, +`NominatedNodeName` isn't taken into account (of course), and the pod could be scheduled onto a different node. + +But, this should be fine because, either way, we're not saying `NominatedNodeName` is something +forcing the scheduler to pick up the node, rather it's just a preference. + + +###### What if there are multiple components that could set `NominatedNodeName` on the same pod + +It's not something newly introduced by this KEP because anyone can set NominatedNodeName today, +but discuss here to form our suggestion. + +Multiple controllers might keep overwriting NominatedNodeName that is set by the others. +Of course, we can regard that just as user's fault though, that'd be undesired situation. + +There could be several ideas to mitigate, or even completely solve by adding a new API. +But, we wouldn't like to introduce any complexity right now because we're not sure how many users would start using this, +and hit this problem. + +So, for now, we'll just document it somewhere as a risk, unrecommended situation, +and in the future, we'll consider something +if we actually observe this problem getting bigger by many people starting using it. + +###### Invalid `NominatedNodeName` prevents the pod from scheduling + +Currently, `NominatedNodeName` field is cleared at the end of failed scheduling cycle if it found the nominated node +unschedulable for the pod. However, in order to make it work for ClusterAutoscaler and Karpenter, we will remove this +logic, and `NominatedNodeName` could stay on the node forever, despite not being a valid suggestions anymore. +As an example, imagine a scenario, where ClusterAutoscaler created a new node and nominated a pod to it, but +before that pod was scheduled, a new higher-priority pod appeared and used the space on that newly created node. +In such a case, it all worked as expected, but we ended up with `NominatedNodeName` set uncorrectly. + +As a mitigation: +- an external component that originally set the `NominatedNodeName` is responsible for clearing or updating +the field to reflect the state +- if it won't happen, given that `NominatedNodeName` is just a hint for scheduler, it will continue to processing +the pod just having a minor performance hit (trying to process a node set via `NNN` first, but falling back to +all nodes anyway). We claim that the additional cost of checking `NominatedNodeName` first is acceptable (even +for big clusters where the performance is critical) because it's just one iteration of Filter plugins +(e.g., if you have 1000 nodes and 16 parallelism (default value), the scheduler needs around 62 iterations of +Filter plugins, approximately. So, adding one iteration on top of that doesn't matter). + +##### Design Details + +If we take into account external components setting `NominatedNodeName`, the design needs to be extended as following: + +###### External components put `NominatedNodeName` + +There aren't any restrictions preventing other components from setting NominatedNodeName as of now. +However, we don't have any validation of how that currently works. +To support the usecases mentioned above we will adjust the scheduler to do the following: +- if NominatedNodeName is set, but corresponding Node doesn't exist, kube-scheduler will NOT clear it when the pod is unschedulable [assuming that a node might appear soon] +- We will rely on the fact that a pod with NominatedNodeName set is resulting in the in-memory reservation for requested resources. +Higher-priority pods can ignore it, but pods with equal or lower priority don't have access to these resources. +This allows us to prioritize nominated pods when nomination was done by external components. +We just need to ensure that in case when NominatedNodeName was assigned by an external component, this nomination will get reflected in scheduler memory. + +We will implement integration tests simulating the above behavior of external components. + +###### The scheduler only modifies `NominatedNodeName`, does not clear it in any case + +As of now, scheduler clears the `NominatedNodeName` field at the end of failed scheduling cycle if it +found the nominated node unschedulable for the pod. However, this won't work if ClusterAutoscaler or Karpenter +would set it during scale up. + +In the most basic case, the node may not yet exist, so clearly it would be unschedulable for the pod. +However, potential mitigation of ignoring non-existing nodes wouldn't work either in the following case: + +1. Pods are unschedulable. For the simplicity, let's say all of them are rejected by NodeResourceFit plugin. (i.e., no node has enough CPU/memory for pod's request) +2. CA finds them, calculates nodes necessary to be created +3. CA puts `NominatedNodeName` on each pod +4. The scheduler keeps trying to schedule those pending pods though, here let's say they're unschedulable (no cluster event happens that could make pods schedulable) until the node is created. +5. The nodes are created, and registered to kube-apiserver. Let's say, at this point, nodes have un-ready taints. +6. The scheduler observes `Node/Create` event, `NodeResourceFit` plugin QHint returns `Queue`, and those pending pods are requeued to activeQ. +7. The scheduling cycle starts handling those pending pods. +8. However, because nodes have un-ready taints, pods are rejected by `TaintToleration` plugin. +9. The scheduler clears `NominatedNodeName` because it finds the nominated node (= new node) unschedulable. + +In order to avoid the above scenarios, we simply remove the clearing logic. This means that scheduler +will never clear the `NominatedNodeName` - it may update it though if based on its scheduling algorithm +it decides to ignore the current value of `NominatedNodeName` and put it on a different node (either to +signal the preemption, or record the decision before binding as described in the above sections). + +##### Test plan: Integration tests + +We're going to add these integration tests: +- The scheduler doesn't clear NominatedNodeName when the nominated node isn't available and the pod is unschedulable. + - And, once the nodes appears, the pod with NNN set is scheduled there (even if there are other equal-priority pending pods). + +Also, with [scheduler-perf](https://github.com/kubernetes/kubernetes/tree/master/test/integration/scheduler_perf), we'll make sure the scheduling throughputs for pods that go through Permit or PreBind don't get regress too much. +We need to accept a small regression to some extent since there'll be a new API call to set NominatedNodeName. +But, as discussed, assuming PreBind already makes some API calls for the pods, the regression there should be small. + + ## Infrastructure Needed (Optional)