From c0bce5c6579d1c55789c0b7c08a54cedeb6da8f7 Mon Sep 17 00:00:00 2001
From: Ania Borowiec <anias@google.com>
Date: Mon, 6 Oct 2025 10:03:52 +0000
Subject: [PATCH 1/5] prr questionnaire updated

---
 .../5278-nominated-node-name-for-expectation/README.md      | 6 +++---
 .../5278-nominated-node-name-for-expectation/kep.yaml       | 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md b/keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md
index 6d9a13f4400..31be7be3194 100644
--- a/keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md
+++ b/keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md
@@ -630,7 +630,7 @@ Pods that are processed by Permit or PreBind plugins get NominatedNodeName durin
 ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
 
 Yes.
-The feature can be disabled in Alpha version by restarting the kube-scheduler and kube-apiserver with the feature-gate off.
+The feature can be disabled in Beta version by restarting the kube-scheduler and kube-apiserver with the feature-gate off.
 
 ###### What happens if we reenable the feature if it was previously rolled back?
 
@@ -752,8 +752,8 @@ No.
 ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
 
 Yes - but it should be negligible impact.
-The memory usage in kube-scheduler is supposed to increase by external components starting to use this
-because when `NominatedNodeName` is added on the pods, the scheduler's internal component called `nominator` has to record them so that scheduling cycles can refer to them as necessary.
+The memory usage in kube-scheduler is supposed to increase because when `NominatedNodeName` is added on the pods, the scheduler's
+internal component called `nominator` has to record them so that scheduling cycles can refer to them as necessary.
 
 ###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
 
diff --git a/keps/sig-scheduling/5278-nominated-node-name-for-expectation/kep.yaml b/keps/sig-scheduling/5278-nominated-node-name-for-expectation/kep.yaml
index 212a884713f..66154d24781 100644
--- a/keps/sig-scheduling/5278-nominated-node-name-for-expectation/kep.yaml
+++ b/keps/sig-scheduling/5278-nominated-node-name-for-expectation/kep.yaml
@@ -17,7 +17,7 @@ approvers:
 
 stage: alpha
 
-latest-milestone: "v1.34"
+latest-milestone: "v1.35"
 
 milestone:
   alpha: "v1.34"

From e0345cd755773ece7e4ac3d6da1f93404dfd777b Mon Sep 17 00:00:00 2001
From: Ania Borowiec <anias@google.com>
Date: Tue, 7 Oct 2025 09:14:36 +0000
Subject: [PATCH 2/5] prr approver updated

---
 keps/prod-readiness/sig-scheduling/5278.yaml | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/keps/prod-readiness/sig-scheduling/5278.yaml b/keps/prod-readiness/sig-scheduling/5278.yaml
index b3de9d3e79c..292cbad97c4 100644
--- a/keps/prod-readiness/sig-scheduling/5278.yaml
+++ b/keps/prod-readiness/sig-scheduling/5278.yaml
@@ -1,3 +1,5 @@
 kep-number: 5278
 alpha:
   approver: "@soltysh"
+beta:
+  approver: "@soltysh"

From 177e10577c395649769ea65b65f0a87ac4d20675 Mon Sep 17 00:00:00 2001
From: Ania Borowiec <anias@google.com>
Date: Wed, 8 Oct 2025 11:03:48 +0000
Subject: [PATCH 3/5] update stage in kep.yaml

---
 .../5278-nominated-node-name-for-expectation/kep.yaml           | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/keps/sig-scheduling/5278-nominated-node-name-for-expectation/kep.yaml b/keps/sig-scheduling/5278-nominated-node-name-for-expectation/kep.yaml
index 66154d24781..4abac9766ea 100644
--- a/keps/sig-scheduling/5278-nominated-node-name-for-expectation/kep.yaml
+++ b/keps/sig-scheduling/5278-nominated-node-name-for-expectation/kep.yaml
@@ -15,7 +15,7 @@ approvers:
   - "@macsko"
   - "@dom4ha"
 
-stage: alpha
+stage: beta
 
 latest-milestone: "v1.35"
 

From 7d504ebb5dcd4ea5ec0c1780857e596fd35f4a5c Mon Sep 17 00:00:00 2001
From: Ania Borowiec <anias@google.com>
Date: Thu, 9 Oct 2025 10:40:01 +0000
Subject: [PATCH 4/5] update kep to narrow down the scope

---
 .../README.md                                 | 406 +++++++++++-------
 1 file changed, 248 insertions(+), 158 deletions(-)

diff --git a/keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md b/keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md
index 31be7be3194..f811ffce52b 100644
--- a/keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md
+++ b/keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md
@@ -81,7 +81,6 @@ tags, and then generate with `hack/update-toc.sh`.
 - [Summary](#summary)
 - [Motivation](#motivation)
   - [External components need to know where the pod is going to be bound](#external-components-need-to-know-where-the-pod-is-going-to-be-bound)
-  - [External components want to specify a preferred pod placement](#external-components-want-to-specify-a-preferred-pod-placement)
   - [Retain the scheduling decision](#retain-the-scheduling-decision)
   - [Goals](#goals)
   - [Non-Goals](#non-goals)
@@ -89,21 +88,16 @@ tags, and then generate with `hack/update-toc.sh`.
   - [User Stories (Optional)](#user-stories-optional)
     - [Story 1: Prevent inappropriate scale downs by Cluster Autoscaler](#story-1-prevent-inappropriate-scale-downs-by-cluster-autoscaler)
     - [Story 2: Scheduler can resume its work after restart](#story-2-scheduler-can-resume-its-work-after-restart)
-    - [Story 3: ClusterAutoscaler or Karpenter can influence scheduling decisions](#story-3-clusterautoscaler-or-karpenter-can-influence-scheduling-decisions)
-    - [Story 4: Kueue specifies <code>NominatedNodeName</code> to indicate where it prefers pods being scheduled to](#story-4-kueue-specifies-nominatednodename-to-indicate-where-it-prefers-pods-being-scheduled-to)
   - [Risks and Mitigations](#risks-and-mitigations)
-    - [NominatedNodeName can already be set by other components now.](#nominatednodename-can-already-be-set-by-other-components-now)
+    - [NominatedNodeName can be set by other components now.](#nominatednodename-can-be-set-by-other-components-now)
     - [Confusing semantics of <code>NominatedNodeName</code>](#confusing-semantics-of-nominatednodename)
     - [Increasing the load to kube-apiserver](#increasing-the-load-to-kube-apiserver)
-    - [Race condition](#race-condition)
     - [Confusion if <code>NominatedNodeName</code> is different from <code>NodeName</code> after all](#confusion-if-nominatednodename-is-different-from-nodename-after-all)
-    - [What if there are multiple components that could set <code>NominatedNodeName</code> on the same pod](#what-if-there-are-multiple-components-that-could-set-nominatednodename-on-the-same-pod)
-    - [Invalid <code>NominatedNodeName</code> prevents the pod from scheduling](#invalid-nominatednodename-prevents-the-pod-from-scheduling)
 - [Design Details](#design-details)
   - [The scheduler puts <code>NominatedNodeName</code>](#the-scheduler-puts-nominatednodename)
   - [External components put <code>NominatedNodeName</code>](#external-components-put-nominatednodename)
   - [The scheduler's cache for <code>NominatedNodeName</code>](#the-schedulers-cache-for-nominatednodename)
-    - [The scheduler only modifies <code>NominatedNodeName</code>, does not clear it in any case](#the-scheduler-only-modifies-nominatednodename-does-not-clear-it-in-any-case)
+    - [The scheduler clears <code>NominatedNodeName</code> after scheduling failure](#the-scheduler-clears-nominatednodename-after-scheduling-failure)
   - [Kube-apiserver clears <code>NominatedNodeName</code> when receiving binding requests](#kube-apiserver-clears-nominatednodename-when-receiving-binding-requests)
   - [Test Plan](#test-plan)
       - [Prerequisite testing updates](#prerequisite-testing-updates)
@@ -126,6 +120,14 @@ tags, and then generate with `hack/update-toc.sh`.
 - [Drawbacks](#drawbacks)
 - [Alternatives](#alternatives)
     - [Introduce a new field](#introduce-a-new-field)
+    - [Allow NominatedNodeName to be set by other components](#allow-nominatednodename-to-be-set-by-other-components)
+      - [Motivation: External components want to specify a preferred pod placement](#motivation-external-components-want-to-specify-a-preferred-pod-placement)
+      - [Goals](#goals-1)
+      - [Non-Goals](#non-goals-1)
+      - [User stories](#user-stories)
+      - [Risks and Mitigations](#risks-and-mitigations-1)
+      - [Design Details](#design-details-1)
+      - [Test plan: Integration tests](#test-plan-integration-tests)
 - [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
 <!-- /toc -->
 
@@ -192,10 +194,9 @@ updates.
 [documentation style guide]: https://github.com/kubernetes/community/blob/master/contributors/guide/style-guide.md
 -->
 
-Use `NominatedNodeName` to express pod placement, expected by the scheduler or expected by other components.
+Use `NominatedNodeName` to express pod placement, expected by the scheduler.
 
 Besides of using `NominatedNodeName` to indicate ongoing preemption, the scheduler can specify it at the beginning of a binding cycle to show an expected pod placement to other components.
-And, also other components can put `NominatedNodeName` on pending pods to indicate the pod is preferred to be scheduled on a specific node.
 
 ## Motivation
 
@@ -212,16 +213,6 @@ misunderstands the node is low-utilized (because the scheduler keeps the place o
 We can expose those internal reservations with `NominatedNodeName` so that external components can take a more appropriate action
 based on the expected pod placement.
 
-### External components want to specify a preferred pod placement
-
-The ClusterAutoscaler or Karpenter internally calculate the pod placement,
-and create new nodes or un-gate pods based on the calculation result.
-The shape and count of newly added nodes assumes some particular pod placement
-and the pods may not fit or satisfy scheduling constraints if placed differently.
-
-By specifying their expectation on `NominatedNodeName`, the scheduler can first check
-whether the pod can go to the nominated node, reducing end-to-end scheduling time.
-
 ### Retain the scheduling decision
 
 At the binding cycle (e.g., PreBind), some plugins could handle something (e.g., volumes, devices) based on the pod's scheduling result.
@@ -234,13 +225,12 @@ and the PreBind plugins can restart their work from where they were before the r
 ### Goals
 
 - The scheduler will use `NominatedNodeName` to express where the pod is going to go before actually binding them.
-- Make sure external components can use `NominatedNodeName` to express where they prefer the pod is going to.
-  - Probably, you can do this with a today's scheduler as well. This proposal wants to discuss/make sure if it actually works, and then add tests etc.
 
 ### Non-Goals
 
-- External components can enforce the scheduler to pick up a specific node via `NominatedNodeName`.
-  - `NominatedNodeName` is just a hint for scheduler and doesn't represent a hard requirement
+- External components can suggest a specific node to kube-scheduler using `NominatedNodeName`.
+  - This is not in scope of this feature for the time being. See the alternatives section for more details.
+
 
 ## Proposal
 
@@ -249,8 +239,6 @@ and the PreBind plugins can restart their work from where they were before the r
 Here is the all use cases of NominatedNodeNames that we're taking into consideration:
 - The scheduler puts it after the preemption (already implemented)
 - The scheduler puts it at the beginning of binding cycles (only if the binding cycles involve PreBind or WaitOnPermit phase)
-- The ClusterAutoscaler or Karpenter puts it after creating a new node for pending pod(s) so that the scheduler
-   can utilize the result of scheduling simulations already made by those components
 
 (Possibly, our future initiative around the workload scheduling (including gang scheduling) can also utilize it,
 but we don't discuss it here because it's not yet concrete.)
@@ -274,33 +262,15 @@ the work that has already been done and increase the end-to-end pod startup late
 
 We need a mechanism to be able to resume the already started work in majority of such situations.
 
-#### Story 3: ClusterAutoscaler or Karpenter can influence scheduling decisions
-
-ClusterAutoscaler or Karpenter perform scheduling simulations to decide what nodes should be
-added to make pending pods schedulable. Their decisions assume a certain placement - if pending
-pods are placed differently, they may not fit on the newly added nodes or may not satisfy their
-scheduling constraints.
-
-In order to improve the end-to-end pod startup latency when cluster scale-up is needed, we need a
-mechanism to communicate the results of scheduling simulations from ClusterAutoscaler or Karpenter
-to scheduler.
-
-#### Story 4: Kueue specifies `NominatedNodeName` to indicate where it prefers pods being scheduled to
-
-Kueue supports scheduling features that are not (yet) supported in core scheduling, such as topology-aware scheduling. 
-When it determines the optimal placement, it needs a mechanism to pass that information to the scheduler.
-Currently it is using NodeSelector to enforce placement of pods and only then ungates the pods. Scheduler doesn't take that information into account until pods are ungated and can schedule other pods in those places in the meantime. 
-It would be beneficial to pass that information to scheduler sooner, as well as allow scheduler to change the decision if the topology constraints are just the soft ones.
-
 ### Risks and Mitigations
 
-#### NominatedNodeName can already be set by other components now.
+#### NominatedNodeName can be set by other components now.
 
-There aren't any guardrails preventing other components from setting NominatedNodeName now.
+There aren't any guardrails preventing other components from setting `NominatedNodeName` now.
 In such cases, the semantic is not well defined now and the outcome of it may not match user
 expectations.
 
-This KEP is a step towards clarifying this semantic instead of maintaining status-quo.
+This KEP is a step towards clarifying this semantic and scheduler's behavior instead of maintaining status-quo.
 
 #### Confusing semantics of `NominatedNodeName`
 
@@ -308,54 +278,35 @@ Up until now, `NominatedNodeName` was expressing the decision made by scheduler
 pod on a given node, while waiting for the preemption. The decision could be changed later so
 it didn't have to be a final decision, but it was describing the "current plan of record".
 
-If we put more components into the picture (e.g. ClusterAutoscaler and Karpenter), we effectively
-get a more complex state machine, with the following states:
+If we add the case of delayed binding, we effectively get a state machine with the following states:
 
 1. pending pod
-2. pod proposed to node (by external component) [not approved by scheduler]
-3. pod nominated to node (based on external proposal) and waiting for node (e.g. being created & ready)
-4. pod nominated to node and waiting for preemption
-5. pod allocated to node and waiting for binding
-6. pod bound
+2. pod nominated to node and waiting for preemption
+3. pod allocated to node and waiting for binding
+4. pod bound
 
-The important part is that if we decide to use `NominatedNodeName` to store all that information,
+The important part is that if we decide to use `NominatedNodeName` to store information for both (2) and (3),
 we're effectively losing the ability to distinguish between those states.
 
 We may argue that as long as the decision was made by the scheduler, the exact reason and state
 probably isn't that important - the content of `NominatedNodeName` can be interpreted as
 "current plan of record for this pod from scheduler perspective".
 
-But the `pod proposed to node` state is visibly different. In particular external components
-may overallocate the pods on the node, those pods may not match scheduling constraints etc.
-We can't claim that it's a current plan of record of the scheduler. It's a hint that we want
-scheduler to take into account.
-
-In other words, from state machine perspective, there is visible difference in who sets the
-`NominatedNodeName`. If it was scheduler, it may mean that there is already ongoing preemption.
-If it was an external component, it's just a hint that may even be ignored.
-However, if we look from consumption point of view - these are effectively the same. We want
+If we look from consumption point of view - these are effectively the same. We want
 to expose the information, that as of now a given node is considered as a potential placement
 for a given pod. It may change, but for now that's what considered. 
 
-Eventually, we may introduce some state machine, where external components could also approve
-schedulers decisions by exposing these states more concretely via the API. But we will be
-able to achieve it in an additive way by exposing the information about the state.
-
-However, we don't need this state machine now, so we just introduce the following rules:
-- Any component can set `NominatedNodeName` if it is currently unset.
+On top of the simple state machine above we introduce the following rules:
 - Scheduler is allowed to overwrite `NominatedNodeName` at any time in case of preemption or
 the beginning of the binding cycle.
-- No external components can overwrite `NominatedNodeName` set by a different component.
-- If `NominatedNodeName` is set, the component who set it is responsible for updating or
-clearing it if its plans were changed (using PUT or APPLY to ensure it won't conflict with
-potential update from scheduler) to reflect the new hint.
+- No external components are expected to overwrite `NominatedNodeName` set by the scheduler (although technically there are no guardrails).
 
 Moreover:
 - Regardless of who set `NominatedNodeName`, its readers should always take that into
 consideration (e.g. ClusterAutoscaler or Karpenter when trying to scale down nodes).
-- In case of faulty components (e.g. overallocation the nodes), these decisions will
-simply be rejected by the scheduler (although the `NominatedNodeName` will remain set
-for the unschedulability period).
+- In case of faulty components (e.g. overallocation of nodes), these decisions will
+simply be rejected by the scheduler (and the `NominatedNodeName` will be cleared before
+moving the rejected pod to unschedulable).
 
 #### Increasing the load to kube-apiserver
 
@@ -372,14 +323,6 @@ For cases with delayed binding, we make an argument that the additional calls ar
 there are other calls related to those operations (e.g. PV creation, PVC binding, etc.) - so the
 overhead of setting `NNN` is a smaller percentage of the whole e2e pod startup flow.
 
-#### Race condition
-
-If an external component adds `NominatedNodeName` to the pod that is going through a scheduling cycle,
-`NominatedNodeName` isn't taken into account (of course), and the pod could be scheduled onto a different node.
-
-But, this should be fine because, either way, we're not saying `NominatedNodeName` is something
-forcing the scheduler to pick up the node, rather it's just a preference.
-
 #### Confusion if `NominatedNodeName` is different from `NodeName` after all
 
 If an external component adds `NominatedNodeName`, but the scheduler picks up a different node,
@@ -393,42 +336,6 @@ We will update the logic so that `NominatedNodeName` field is cleared during `bi
 We believe that ensuring that `NominatedNodeName` can't be set after the pod is already bound
 is niche enough feature that doesn't justify an attempt to strengthening the validation.
 
-#### What if there are multiple components that could set `NominatedNodeName` on the same pod
-
-It's not something newly introduced by this KEP because anyone can set NominatedNodeName today,
-but discuss here to form our suggestion. 
-
-Multiple controllers might keep overwriting NominatedNodeName that is set by the others. 
-Of course, we can regard that just as user's fault though, that'd be undesired situation.
-
-There could be several ideas to mitigate, or even completely solve by adding a new API.
-But, we wouldn't like to introduce any complexity right now because we're not sure how many users would start using this,
-and hit this problem.
-
-So, for now, we'll just document it somewhere as a risk, unrecommended situation, 
-and in the future, we'll consider something
-if we actually observe this problem getting bigger by many people starting using it.
-
-#### Invalid `NominatedNodeName` prevents the pod from scheduling
-
-Currently, `NominatedNodeName` field is cleared at the end of failed scheduling cycle if it found the nominated node
-unschedulable for the pod. However, in order to make it work for ClusterAutoscaler and Karpenter, we will remove this
-logic, and `NominatedNodeName` could stay on the node forever, despite not being a valid suggestions anymore.
-As an example, imagine a scenario, where ClusterAutoscaler created a new node a nominated a pod to it, but
-before that pod was scheduled, a new higher-priority pod appeared and used the space on that newly created node.
-In such a case, it all worked as expected, but we ended up with `NominatedNodeName` set uncorrectly.
-
-As a mitigation:
-- an external component that originally set the `NominatedNodeName` is responsible for clearing or updating
-the field to reflect the state
-- if it won't happen, given that `NominatedNodeName` is just a hint for scheduler, it will continue to processing
-the pod just having a minor performance hit (trying to process a node set via `NNN` first, but falling back to
-all nodes anyway). We claim that the additional cost of checking `NominatedNodeName` first is acceptable (even
-for big clusters where the performance is critical) because it's just one iteration of Filter plugins
-(e.g., if you have 1000 nodes and 16 parallelism (default value), the scheduler needs around 62 iterations of
-Filter plugins, approximately. So, adding one iteration on top of that doesn't matter).
-
-
 ## Design Details
 
 <!--
@@ -474,14 +381,6 @@ so that the scheduler can wisely skip setting `NominatedNodeName`, taking their
 
 There aren't any restrictions preventing other components from setting NominatedNodeName as of now.
 However, we don't have any validation of how that currently works.
-To support the usecases mentioned above we will adjust the scheduler to do the following:
-- if NominatedNodeName is set, but corresponding Node doesn't exist, kube-scheduler will NOT clear it when the pod is unschedulable [assuming that a node might appear soon]
-- We will rely on the fact that a pod with NominatedNodeName set is resulting in the in-memory reservation for requested resources. 
-Higher-priority pods can ignore it, but pods with equal or lower priority don't have access to these resources. 
-This allows us to prioritize nominated pods when nomination was done by external components. 
-We just need to ensure that in case when NominatedNodeName was assigned by an external component, this nomination will get reflected in scheduler memory.
-
-We will implement integration tests simulating the above behavior of external components.
 
 ### The scheduler's cache for `NominatedNodeName`
 
@@ -497,31 +396,12 @@ This `deletePodFromSchedulingQueue` is called when unscheduled pods are removed,
 or pods are assigned to nodes (EventHandler calls `DeleteFunc` handler when [the condition](https://github.com/kubernetes/kubernetes/blob/b2dfba4151b859c31a27fe31f8703f9b2b758270/pkg/scheduler/eventhandlers.go#L416) is no longer met).
 
 So, as a conclusion, there should be nothing to implement newly around it.
-Again, we'll ensure this scenario works correctly via tests.
-
-#### The scheduler only modifies `NominatedNodeName`, does not clear it in any case
+We'll ensure this scenario works correctly via tests.
 
-As of now, scheduler clears the `NominatedNodeName` field at the end of failed scheduling cycle if it
-found the nominated node unschedulable for the pod. However, this won't work if ClusterAutoscaler or Karpenter
-would set it during scale up.
-
-In the most basic case, the node may not yet exist, so clearly it would be unschedulable for the pod.
-However, potential mitigation of ignoring non-existing nodes wouldn't work either in the following case:
+#### The scheduler clears `NominatedNodeName` after scheduling failure
 
-1. Pods are unschedulable. For the simplicity, let's say all of them are rejected by NodeResourceFit plugin. (i.e., no node has enough CPU/memory for pod's request)
-2. CA finds them, calculates nodes necessary to be created
-3. CA puts `NominatedNodeName` on each pod
-4. The scheduler keeps trying to schedule those pending pods though, here let's say they're unschedulable (no cluster event happens that could make pods schedulable) until the node is created.
-5. The nodes are created, and registered to kube-apiserver. Let's say, at this point, nodes have un-ready taints.
-6. The scheduler observes `Node/Create` event, `NodeResourceFit` plugin QHint returns `Queue`, and those pending pods are requeued to activeQ.
-7. The scheduling cycle starts handling those pending pods.
-8. However, because nodes have un-ready taints, pods are rejected by `TaintToleration` plugin.
-9. The scheduler clears `NominatedNodeName` because it finds the nominated node (= new node) unschedulable.
-
-In order to avoid the above scenarios, we simply remove the clearing logic. This means that scheduler
-will never clear the `NominatedNodeName` - it may update it though if based on its scheduling algorithm
-it decides to ignore the current value of `NominatedNodeName` and put it on a different node (either to
-signal the preemption, or record the decision before binding as described in the above sections).
+As of now the scheduler clears the `NominatedNodeName` field at the end of failed scheduling cycle, if it
+found the nominated node unschedulable for the pod. This logic remains unchanged.
  
 ### Kube-apiserver clears `NominatedNodeName` when receiving binding requests
 
@@ -545,15 +425,13 @@ to implement this enhancement.
 We're going to add these integration tests:
 - The scheduler prefers to picking up nodes based on NominatedNodeName on pods, if the nodes are available.
 - The scheduler ignores NominatedNodeName reservations on pods when it's scheduling higher priority pods.
-- The scheduler doesn't clear NominatedNodeName when the nominated node isn't available and the pod is unschedulable.
-  - And, once the nodes appears, the pod with NNN set is scheduled there (even if there are other equal-priority pending pods).
 - The scheduler overwrites NominatedNodeName when it performs the preemption, or when it finds another spot in another node and proceeding to the binding cycle (assuming there's a PreBind plugin).
 - The scheduler puts NominatedNodeName at the beginning of binding cycles if Permit or PreBind plugin will do some work.
   - And, the scheduler (actually kube-apiserver, when receiving a binding request) clears NominatedNodeName when the pod is actually bound.
 
 Also, with [scheduler-perf](https://github.com/kubernetes/kubernetes/tree/master/test/integration/scheduler_perf), we'll make sure the scheduling throughputs for pods that go through Permit or PreBind don't get regress too much.
 We need to accept a small regression to some extent since there'll be a new API call to set NominatedNodeName. 
-But, as discussed, assuming PreBind already makes some API calls for the pods, the regrassion there should be small.
+But, as discussed, assuming PreBind already makes some API calls for the pods, the regression there should be small.
 
 ##### e2e tests
 
@@ -578,8 +456,7 @@ and an e2e test wouldn't add any additional value.
 
 **Upgrade**
 
-During the alpha period, the feature gates `NominatedNodeNameForExpectation` and `ClearingNominatedNodeNameAfterBinding` are disabled by default,
-so users have to enable the gates to opt in.
+During the beta period, the feature gates `NominatedNodeNameForExpectation` and `ClearingNominatedNodeNameAfterBinding` are enabled by default, no action is needed.
 
 **Downgrade**
 
@@ -775,6 +652,7 @@ Unknown.
 
 - 7th May 2025: The initial KEP is submitted.
 - 31st Jul 2025: The enhancement was demoted to alpha, because it haven't met all beta requirements for v1.34.
+- 9th Oct 2025: The enhancement was promoted to beta, with the scope narrowed down to allow setting `NominatedNodeName` only in the kube-scheduler, having other components (e.g. Cluster Autoscaler or Karpenter) use the field as read-only.
 
 ## Drawbacks
 
@@ -789,7 +667,6 @@ What other approaches did you consider, and why did you rule them out? These do
 not need to be as detailed as the proposal, but should include enough
 information to express the idea and why it was not acceptable.
 -->
-
 #### Introduce a new field
 
 Instead of using `NominatedNodeName` to let external components to hint scheduler, we considered
@@ -799,6 +676,219 @@ external components it doesn't eliminate the potential races either. If in the f
 that distinsuighing that is needed, we believe that we can model such state muchine with an
 additional field in a purely additive way.
 
+#### Allow NominatedNodeName to be set by other components
+
+In v1.35 this feature is being narrowed down to one-way communication: only kube-scheduler is allowed to set `NominatedNodeName`,
+while for other components this field should be read-only.
+
+The alternative to consider for future releases is that other components can set `NominatedNodeName` in pending pods to indicate
+the pod is preferred to be scheduled on a specific node.
+
+##### Motivation: External components want to specify a preferred pod placement
+
+The ClusterAutoscaler or Karpenter internally calculate the pod placement,
+and create new nodes or un-gate pods based on the calculation result.
+The shape and count of newly added nodes assumes some particular pod placement
+and the pods may not fit or satisfy scheduling constraints if placed differently.
+
+By specifying their expectation on `NominatedNodeName`, the scheduler can first check
+whether the pod can go to the nominated node, reducing end-to-end scheduling time.
+
+##### Goals
+
+- Make sure external components can use `NominatedNodeName` to express where they prefer the pod is going to.
+  - Probably, you can do this with a today's scheduler as well. This proposal wants to discuss/make sure if it actually works, and then add tests etc.
+
+##### Non-Goals
+
+- External components can enforce the scheduler to pick up a specific node via `NominatedNodeName`.
+  - `NominatedNodeName` is just a hint for scheduler and doesn't represent a hard requirement
+
+##### User stories
+
+The use case supported by this feature is:
+- The ClusterAutoscaler or Karpenter sets `NominatedNodeName` after creating a new node for pending pod(s), so that the scheduler
+   can utilize the result of scheduling simulations already calculated by those components
+
+###### Story 1: ClusterAutoscaler or Karpenter can influence scheduling decisions
+
+ClusterAutoscaler or Karpenter perform scheduling simulations to decide what nodes should be
+added to make pending pods schedulable. Their decisions assume a certain placement - if pending
+pods are placed differently, they may not fit on the newly added nodes or may not satisfy their
+scheduling constraints.
+
+In order to improve the end-to-end pod startup latency when cluster scale-up is needed, we need a
+mechanism to communicate the results of scheduling simulations from ClusterAutoscaler or Karpenter
+to scheduler.
+
+###### Story 2: Kueue specifies `NominatedNodeName` to indicate where it prefers pods being scheduled to
+
+Kueue supports scheduling features that are not (yet) supported in core scheduling, such as topology-aware scheduling. 
+When it determines the optimal placement, it needs a mechanism to pass that information to the scheduler.
+Currently it is using NodeSelector to enforce placement of pods and only then ungates the pods. Scheduler doesn't take that information into account until pods are ungated and can schedule other pods in those places in the meantime. 
+It would be beneficial to pass that information to scheduler sooner, as well as allow scheduler to change the decision if the topology constraints are just the soft ones.
+
+##### Risks and Mitigations
+
+###### NominatedNodeName can be set by other components now.
+
+There aren't any guardrails preventing other components from setting `NominatedNodeName` now.
+In such cases, the semantic is not well defined now and the outcome of it may not match user
+expectations.
+
+This section is a step towards clarifying this semantic instead of maintaining status-quo.
+
+###### Confusing semantics of `NominatedNodeName`
+
+Up until now, `NominatedNodeName` was expressing the decision made by scheduler to put a given
+pod on a given node, while waiting for the preemption. The decision could be changed later so
+it didn't have to be a final decision, but it was describing the "current plan of record".
+
+If we put more components into the picture (e.g. ClusterAutoscaler and Karpenter), we effectively
+get a more complex state machine, with the following states:
+
+1. pending pod
+2. pod proposed to node (by external component) [not approved by scheduler]
+3. pod nominated to node (based on external proposal) and waiting for node (e.g. being created & ready)
+4. pod nominated to node and waiting for preemption
+5. pod allocated to node and waiting for binding
+6. pod bound
+
+The important part is that if we decide to use `NominatedNodeName` to store all that information,
+we're effectively losing the ability to distinguish between those states.
+
+We may argue that as long as the decision was made by the scheduler, the exact reason and state
+probably isn't that important - the content of `NominatedNodeName` can be interpreted as
+"current plan of record for this pod from scheduler perspective".
+
+But the `pod proposed to node` state is visibly different. In particular external components
+may overallocate the pods on the node, those pods may not match scheduling constraints etc.
+We can't claim that it's a current plan of record of the scheduler. It's a hint that we want
+scheduler to take into account.
+
+In other words, from state machine perspective, there is visible difference in who sets the
+`NominatedNodeName`. If it was scheduler, it may mean that there is already ongoing preemption.
+If it was an external component, it's just a hint that may even be ignored.
+However, if we look from consumption point of view - these are effectively the same. We want
+to expose the information, that as of now a given node is considered as a potential placement
+for a given pod. It may change, but for now that's what considered. 
+
+Eventually, we may introduce some state machine, where external components could also approve
+schedulers decisions by exposing these states more concretely via the API. But we will be
+able to achieve it in an additive way by exposing the information about the state.
+
+However, we don't need this state machine now, so we just introduce the following rules:
+- Any component can set `NominatedNodeName` if it is currently unset.
+- Scheduler is allowed to overwrite `NominatedNodeName` at any time in case of preemption or
+the beginning of the binding cycle.
+- No external components can overwrite `NominatedNodeName` set by a different component.
+- If `NominatedNodeName` is set, the component who set it is responsible for updating or
+clearing it if its plans were changed (using PUT or APPLY to ensure it won't conflict with
+potential update from scheduler) to reflect the new hint.
+
+Moreover:
+- Regardless of who set `NominatedNodeName`, its readers should always take that into
+consideration (e.g. ClusterAutoscaler or Karpenter when trying to scale down nodes).
+- In case of faulty components (e.g. overallocation the nodes), these decisions will
+simply be rejected by the scheduler (although the `NominatedNodeName` will remain set
+for the unschedulability period).
+
+###### Race condition
+
+If an external component adds `NominatedNodeName` to the pod that is going through a scheduling cycle,
+`NominatedNodeName` isn't taken into account (of course), and the pod could be scheduled onto a different node.
+
+But, this should be fine because, either way, we're not saying `NominatedNodeName` is something
+forcing the scheduler to pick up the node, rather it's just a preference.
+
+
+###### What if there are multiple components that could set `NominatedNodeName` on the same pod
+
+It's not something newly introduced by this KEP because anyone can set NominatedNodeName today,
+but discuss here to form our suggestion. 
+
+Multiple controllers might keep overwriting NominatedNodeName that is set by the others. 
+Of course, we can regard that just as user's fault though, that'd be undesired situation.
+
+There could be several ideas to mitigate, or even completely solve by adding a new API.
+But, we wouldn't like to introduce any complexity right now because we're not sure how many users would start using this,
+and hit this problem.
+
+So, for now, we'll just document it somewhere as a risk, unrecommended situation, 
+and in the future, we'll consider something
+if we actually observe this problem getting bigger by many people starting using it.
+
+###### Invalid `NominatedNodeName` prevents the pod from scheduling
+
+Currently, `NominatedNodeName` field is cleared at the end of failed scheduling cycle if it found the nominated node
+unschedulable for the pod. However, in order to make it work for ClusterAutoscaler and Karpenter, we will remove this
+logic, and `NominatedNodeName` could stay on the node forever, despite not being a valid suggestions anymore.
+As an example, imagine a scenario, where ClusterAutoscaler created a new node and nominated a pod to it, but
+before that pod was scheduled, a new higher-priority pod appeared and used the space on that newly created node.
+In such a case, it all worked as expected, but we ended up with `NominatedNodeName` set uncorrectly.
+
+As a mitigation:
+- an external component that originally set the `NominatedNodeName` is responsible for clearing or updating
+the field to reflect the state
+- if it won't happen, given that `NominatedNodeName` is just a hint for scheduler, it will continue to processing
+the pod just having a minor performance hit (trying to process a node set via `NNN` first, but falling back to
+all nodes anyway). We claim that the additional cost of checking `NominatedNodeName` first is acceptable (even
+for big clusters where the performance is critical) because it's just one iteration of Filter plugins
+(e.g., if you have 1000 nodes and 16 parallelism (default value), the scheduler needs around 62 iterations of
+Filter plugins, approximately. So, adding one iteration on top of that doesn't matter).
+
+##### Design Details
+
+If we take into account external components setting `NominatedNodeName`, the design needs to be extended as following:
+
+###### External components put `NominatedNodeName`
+
+There aren't any restrictions preventing other components from setting NominatedNodeName as of now.
+However, we don't have any validation of how that currently works.
+To support the usecases mentioned above we will adjust the scheduler to do the following:
+- if NominatedNodeName is set, but corresponding Node doesn't exist, kube-scheduler will NOT clear it when the pod is unschedulable [assuming that a node might appear soon]
+- We will rely on the fact that a pod with NominatedNodeName set is resulting in the in-memory reservation for requested resources. 
+Higher-priority pods can ignore it, but pods with equal or lower priority don't have access to these resources. 
+This allows us to prioritize nominated pods when nomination was done by external components. 
+We just need to ensure that in case when NominatedNodeName was assigned by an external component, this nomination will get reflected in scheduler memory.
+
+We will implement integration tests simulating the above behavior of external components.
+
+###### The scheduler only modifies `NominatedNodeName`, does not clear it in any case
+
+As of now, scheduler clears the `NominatedNodeName` field at the end of failed scheduling cycle if it
+found the nominated node unschedulable for the pod. However, this won't work if ClusterAutoscaler or Karpenter
+would set it during scale up.
+
+In the most basic case, the node may not yet exist, so clearly it would be unschedulable for the pod.
+However, potential mitigation of ignoring non-existing nodes wouldn't work either in the following case:
+
+1. Pods are unschedulable. For the simplicity, let's say all of them are rejected by NodeResourceFit plugin. (i.e., no node has enough CPU/memory for pod's request)
+2. CA finds them, calculates nodes necessary to be created
+3. CA puts `NominatedNodeName` on each pod
+4. The scheduler keeps trying to schedule those pending pods though, here let's say they're unschedulable (no cluster event happens that could make pods schedulable) until the node is created.
+5. The nodes are created, and registered to kube-apiserver. Let's say, at this point, nodes have un-ready taints.
+6. The scheduler observes `Node/Create` event, `NodeResourceFit` plugin QHint returns `Queue`, and those pending pods are requeued to activeQ.
+7. The scheduling cycle starts handling those pending pods.
+8. However, because nodes have un-ready taints, pods are rejected by `TaintToleration` plugin.
+9. The scheduler clears `NominatedNodeName` because it finds the nominated node (= new node) unschedulable.
+
+In order to avoid the above scenarios, we simply remove the clearing logic. This means that scheduler
+will never clear the `NominatedNodeName` - it may update it though if based on its scheduling algorithm
+it decides to ignore the current value of `NominatedNodeName` and put it on a different node (either to
+signal the preemption, or record the decision before binding as described in the above sections).
+
+##### Test plan: Integration tests
+
+We're going to add these integration tests:
+- The scheduler doesn't clear NominatedNodeName when the nominated node isn't available and the pod is unschedulable.
+  - And, once the nodes appears, the pod with NNN set is scheduled there (even if there are other equal-priority pending pods).
+
+Also, with [scheduler-perf](https://github.com/kubernetes/kubernetes/tree/master/test/integration/scheduler_perf), we'll make sure the scheduling throughputs for pods that go through Permit or PreBind don't get regress too much.
+We need to accept a small regression to some extent since there'll be a new API call to set NominatedNodeName. 
+But, as discussed, assuming PreBind already makes some API calls for the pods, the regression there should be small.
+
+
 ## Infrastructure Needed (Optional)
 
 <!--

From e2533f760184ce38bdcf7d46816a11e01e52be97 Mon Sep 17 00:00:00 2001
From: Ania Borowiec <anias@google.com>
Date: Mon, 13 Oct 2025 11:57:11 +0000
Subject: [PATCH 5/5] review comments applied

---
 .../README.md                                 | 65 +++++++------------
 .../kep.yaml                                  |  1 +
 2 files changed, 23 insertions(+), 43 deletions(-)

diff --git a/keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md b/keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md
index f811ffce52b..a23e36e8b03 100644
--- a/keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md
+++ b/keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md
@@ -89,13 +89,10 @@ tags, and then generate with `hack/update-toc.sh`.
     - [Story 1: Prevent inappropriate scale downs by Cluster Autoscaler](#story-1-prevent-inappropriate-scale-downs-by-cluster-autoscaler)
     - [Story 2: Scheduler can resume its work after restart](#story-2-scheduler-can-resume-its-work-after-restart)
   - [Risks and Mitigations](#risks-and-mitigations)
-    - [NominatedNodeName can be set by other components now.](#nominatednodename-can-be-set-by-other-components-now)
     - [Confusing semantics of <code>NominatedNodeName</code>](#confusing-semantics-of-nominatednodename)
     - [Increasing the load to kube-apiserver](#increasing-the-load-to-kube-apiserver)
-    - [Confusion if <code>NominatedNodeName</code> is different from <code>NodeName</code> after all](#confusion-if-nominatednodename-is-different-from-nodename-after-all)
 - [Design Details](#design-details)
   - [The scheduler puts <code>NominatedNodeName</code>](#the-scheduler-puts-nominatednodename)
-  - [External components put <code>NominatedNodeName</code>](#external-components-put-nominatednodename)
   - [The scheduler's cache for <code>NominatedNodeName</code>](#the-schedulers-cache-for-nominatednodename)
     - [The scheduler clears <code>NominatedNodeName</code> after scheduling failure](#the-scheduler-clears-nominatednodename-after-scheduling-failure)
   - [Kube-apiserver clears <code>NominatedNodeName</code> when receiving binding requests](#kube-apiserver-clears-nominatednodename-when-receiving-binding-requests)
@@ -126,6 +123,7 @@ tags, and then generate with `hack/update-toc.sh`.
       - [Non-Goals](#non-goals-1)
       - [User stories](#user-stories)
       - [Risks and Mitigations](#risks-and-mitigations-1)
+    - [Confusion if <code>NominatedNodeName</code> is different from <code>NodeName</code> after all](#confusion-if-nominatednodename-is-different-from-nodename-after-all)
       - [Design Details](#design-details-1)
       - [Test plan: Integration tests](#test-plan-integration-tests)
 - [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
@@ -264,14 +262,6 @@ We need a mechanism to be able to resume the already started work in majority of
 
 ### Risks and Mitigations
 
-#### NominatedNodeName can be set by other components now.
-
-There aren't any guardrails preventing other components from setting `NominatedNodeName` now.
-In such cases, the semantic is not well defined now and the outcome of it may not match user
-expectations.
-
-This KEP is a step towards clarifying this semantic and scheduler's behavior instead of maintaining status-quo.
-
 #### Confusing semantics of `NominatedNodeName`
 
 Up until now, `NominatedNodeName` was expressing the decision made by scheduler to put a given
@@ -296,18 +286,6 @@ If we look from consumption point of view - these are effectively the same. We w
 to expose the information, that as of now a given node is considered as a potential placement
 for a given pod. It may change, but for now that's what considered. 
 
-On top of the simple state machine above we introduce the following rules:
-- Scheduler is allowed to overwrite `NominatedNodeName` at any time in case of preemption or
-the beginning of the binding cycle.
-- No external components are expected to overwrite `NominatedNodeName` set by the scheduler (although technically there are no guardrails).
-
-Moreover:
-- Regardless of who set `NominatedNodeName`, its readers should always take that into
-consideration (e.g. ClusterAutoscaler or Karpenter when trying to scale down nodes).
-- In case of faulty components (e.g. overallocation of nodes), these decisions will
-simply be rejected by the scheduler (and the `NominatedNodeName` will be cleared before
-moving the rejected pod to unschedulable).
-
 #### Increasing the load to kube-apiserver
 
 Setting a NominatedNodeName is an additional API call that then multiple components in the system
@@ -323,19 +301,6 @@ For cases with delayed binding, we make an argument that the additional calls ar
 there are other calls related to those operations (e.g. PV creation, PVC binding, etc.) - so the
 overhead of setting `NNN` is a smaller percentage of the whole e2e pod startup flow.
 
-#### Confusion if `NominatedNodeName` is different from `NodeName` after all
-
-If an external component adds `NominatedNodeName`, but the scheduler picks up a different node,
-`NominatedNodeName` is just overwritten by a final decision of the scheduler.
-
-But, if an external component updates `NominatedNodeName` that is set by the scheduler, 
-the pod could end up having different `NominatedNodeName` and `NodeName`.
-
-We will update the logic so that `NominatedNodeName` field is cleared during `binding` call
-
-We believe that ensuring that `NominatedNodeName` can't be set after the pod is already bound
-is niche enough feature that doesn't justify an attempt to strengthening the validation.
-
 ## Design Details
 
 <!--
@@ -377,11 +342,6 @@ We determine if each plugin is relevant to the pod by Skip status from PreFilter
 In this way, even if users have some PreBind custom plugins, they can implement `PreBindPreFlight()` appropriately 
 so that the scheduler can wisely skip setting `NominatedNodeName`, taking their custom logic into consideration.
 
-### External components put `NominatedNodeName`
-
-There aren't any restrictions preventing other components from setting NominatedNodeName as of now.
-However, we don't have any validation of how that currently works.
-
 ### The scheduler's cache for `NominatedNodeName`
 
 Here, we'll ensure that works for non-existing nodes too and if those nodes won't appear in the future, it won't leak the memory.
@@ -405,8 +365,7 @@ found the nominated node unschedulable for the pod. This logic remains unchanged
  
 ### Kube-apiserver clears `NominatedNodeName` when receiving binding requests
 
-As discussed at [Confusion if `NominatedNodeName` is different from `NodeName` after all](#confusion-if-nominatednodename-is-different-from-nodename-after-all),
-we update kube-apiserver so that it clears `NominatedNodeName` when receiving binding requests.
+We update kube-apiserver so that it clears `NominatedNodeName` when receiving binding requests.
 
 ### Test Plan
 
@@ -648,6 +607,13 @@ Unknown.
 
 ###### What steps should be taken if SLOs are not being met to determine the problem?
 
+Since SLOs can be impacted by multiple components and mechanisms in kubernetes, there is not straightforward algorithm to determine the problem. The general approach to investigating issues is described below.
+
+If kube-scheduler SLOs are not being met, we should first check if other components of kubernetes (e.g. kube-apiserver) are experiencing slowdown or increased error rates as well. If that is the case, we should find out whether there is a global issue with an already-determined cause.
+A longer turnaround in kube-apiserver handling API requests may result in rising values of `scheduling_algorithm_duration_seconds` and lower values of `schedule_attempts_total`.
+
+If we suspect that there is an ongoing problem inside kube-scheduler and that it is triggered by handling nominated node names, we should check kube-scheduler logs for failed scheduling of pods that had been waiting for preemption of victims, or for failed binding of pods that have nominated node name set - and investigate further.
+
 ## Implementation History
 
 - 7th May 2025: The initial KEP is submitted.
@@ -837,6 +803,19 @@ for big clusters where the performance is critical) because it's just one iterat
 (e.g., if you have 1000 nodes and 16 parallelism (default value), the scheduler needs around 62 iterations of
 Filter plugins, approximately. So, adding one iteration on top of that doesn't matter).
 
+#### Confusion if `NominatedNodeName` is different from `NodeName` after all
+
+If an external component adds `NominatedNodeName`, but the scheduler picks up a different node,
+`NominatedNodeName` is just overwritten by a final decision of the scheduler.
+
+But, if an external component updates `NominatedNodeName` that is set by the scheduler, 
+the pod could end up having different `NominatedNodeName` and `NodeName`.
+
+We will update the logic so that `NominatedNodeName` field is cleared during `binding` call
+
+We believe that ensuring that `NominatedNodeName` can't be set after the pod is already bound
+is niche enough feature that doesn't justify an attempt to strengthening the validation.
+
 ##### Design Details
 
 If we take into account external components setting `NominatedNodeName`, the design needs to be extended as following:
diff --git a/keps/sig-scheduling/5278-nominated-node-name-for-expectation/kep.yaml b/keps/sig-scheduling/5278-nominated-node-name-for-expectation/kep.yaml
index 4abac9766ea..4c9499e3ed6 100644
--- a/keps/sig-scheduling/5278-nominated-node-name-for-expectation/kep.yaml
+++ b/keps/sig-scheduling/5278-nominated-node-name-for-expectation/kep.yaml
@@ -3,6 +3,7 @@ kep-number: 5278
 authors:
   - "@sanposhiho" 
   - "@wojtek-t" 
+  - "@ania-borowiec"
 owning-sig: sig-scheduling
 participating-sigs:
   - sig-autoscaling