update node info processors to include unschedulable nodes #8520

elmiko · 2025-09-10T20:21:18Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR adds a new lister for ready unschedulable nodes, it also connects that lister to a new parameter in the node info processors Process function. This change enables the autoscaler to use unschedulable, but otherwise ready, nodes as a last resort when creating node templates for scheduling simulation.

Which issue(s) this PR fixes:

Fixes #8380

Special notes for your reviewer:

I'm not sure if this is the best way to solve this problem, but i am proposing this for further discussion and design.

Does this PR introduce a user-facing change?

Node groups where all the nodes are ready but unschedulable will be processed as potential candidates for scaling when simulating cluster scheduling.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2025-09-10T20:21:31Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: elmiko
Once this PR has been reviewed and has the lgtm label, please assign aleksandra-malinowska for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

cluster-autoscaler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

elmiko · 2025-09-10T20:21:50Z

i'm working on adding more unit tests for this behavior, but i wanted to share this solution so we could start talking about it.

elmiko · 2025-10-02T20:50:56Z

i've rewritten this patch to use all nodes as the secondary value instead of using a new list of ready unschedulable nodes.

elmiko · 2025-10-02T21:09:16Z

i need to do a little more testing on this locally, but i think this is fine for review.

elmiko · 2025-10-03T13:48:04Z

cluster-autoscaler/processors/nodeinfosprovider/mixed_nodeinfos_processor.go

 	// Last resort - unready/unschedulable nodes.
-	for _, node := range nodes {
+	// we want to check not only the ready nodes, but also ready unschedulable nodes.
+	for _, node := range append(nodes, allNodes...) {


i'm not sure that this is appropriate to append these. theoretically the allNodes should already contain nodes. i'm going to test this out using just allNodes.

due to filtering that happens in obtainNodeLists, we need to combine both lists of nodes here.

elmiko · 2025-10-03T16:38:32Z

i updated the argument names in the Process function to make the source of the nodes more clear. i also changed the mixed node info processor to not double count the nodes for the unschedulable/unready detection clause.

elmiko · 2025-10-03T16:49:10Z

it seems like the update to the mixed node processor needs a little more investigation.

elmiko · 2025-10-03T17:01:36Z

it looks like we need both the readyNodes and allNodes lists due to the filtering that happens in the core.

elmiko · 2025-10-08T18:49:25Z

rebased

elmiko · 2025-10-14T15:27:56Z

@jackfrancis @towca any chance at a review here?

cluster-autoscaler/processors/nodeinfosprovider/mixed_nodeinfos_processor.go

elmiko · 2025-10-16T21:08:38Z

In any case, IMO the most readable change would be to:

Start passing allNodes instead of readyNodes to TemplateNodeInfoProvider.Process() without changing the signature. This is what the interface definition suggests anyway.

At the beginning of MixedTemplateNodeInfoProvider.Process(), group the passed allNodes into good and bad candidates utilizing isNodeGoodTemplateCandidate(). Then iterate over the good ones in the first loop, and over the bad ones in the last loop.

i can put together a patch like this and give it some tests.

This change ensures that a sanitized node has its .spec.unschedulable field set to false.

This change passes all the nodes to the mixed node info provider processor that is called from `RunOnce`. The change is to allow unschedulable and unready nodes to be processed as bad canidates during the node info template generation. The Process function has been updated to separate nodes into good and bad candidates to make the filtering match the original intent.

elmiko · 2025-10-22T15:30:33Z

rebased and updated with the requested changes.

towca · 2025-10-22T18:27:58Z

cluster-autoscaler/simulator/node_info_utils.go

 	}
 	newNode.Labels[apiv1.LabelHostname] = newName

+	newNode.Spec.Unschedulable = false


This is a bit scary, we're essentially making the Unschedulable field work like a status taint without explicitly configuring it like we do for taints. Are we sure this is a good idea in the general case? Maybe we should extend the TaintConfig so that the behavior can be configured explicitly?

At minimum I think we should add a comment explaining that:

Normally when we pick the base Node to sanitize we pick a healthy, schedulable Node and assume that a new one will be similarly healthy.

If there are no healthy, schedulable Nodes to pick and NodeGroup.TemplateNodeInfo() returns cloudprovider.ErrNotImplemented, we'll try sanitizng an unhealthy one that might have the Unschedulable field set to true. In this case we clear the field during sanitization on the assumption that the unschedulability is transient/specific to a given Node, and a new Node will not have the field set.

Is this something we should be assuming? Does anyone know how likely it is to happen in practice that we have a NodeGroup in which all new Nodes appear with the Unschedulable field set for an extended period of time?

I'm mostly worried about this breaking backwards compatibility for cloud providers that don't implement NodeGroup.TemplateNodeInfo():

Assume such a cloud provider can have a NodeGroup where all Nodes are expected to have the Unschedulable field set for extended periods of time. If CA scales the group up, a new Node will also have the field set and the Pod won't be able to schedule.

Before this change, CA would see all Nodes from the NodeGroup as "bad" candidates, so it wouldn't be able to create a nodeInfo for the NodeGroup, so it wouldn't attempt to scale it up. If there is another NodeGroup with healthy Nodes that can work for the pending Pods, CA would scale that one instead.

After this change, CA would sanitize one of the bad candidates, and possibly attempt to scale up a NodeGroup that doesn't actually work.

@x13n @jackfrancis WDYT?

Does anyone know how likely it is to happen in practice that we have a NodeGroup in which all new Nodes appear with the Unschedulable field set for an extended period of time?

For the "all new Nodes" scenario it would be an external cloudprovider scenario where we rely upon the cloudprovider to set Unschedulable to False after certain conditions are met. That sounds plausible and not merely theoretical to me?

I think to address your concern @towca we have to have some confidence that we can ignore the Unschedulable field value rather than simply overwrite it every time. A couple of paths forward:

as you point out, determine with some confidence the distinction between (1) "node that is Unschedulable for a reason that has something do with its group", i.e., if we replicate out more nodes in that group they will inherit that Unschedulable outcome and (2) node that is Unschedulable for reasons that are unique and non-replicable

just add a new feature flag "IncludeUnschedulableNodeCandidates" (or something like that) that is opt-in -- perhaps the conditions that would induce a user to want to include Unschedulable nodes for inclusion in node template candidacy are environmental and not easily discerned by standard k8s API signals, in which case we leave it to the user to determine that they want to use this strategy

the Unschedulable field was key to the bug we are seeing, as the user is essentially setting Unschedulable: true for all the nodes in a node group as part of their upgrade process. new nodes would not have entered with Unschedulable: true, so this created a situation in the autoscaler where it could not process those nodes.

So in the concrete situation you're describing there is a known (temporary) Unschedulable: true node state during regular upgrades, that can occasionally intersect w/ operational capacity needs, and when those intersect, new infra is failing to be provisioned in a timely way?

right, for example, i want to use the autoscaler to force an expansion of a node group. i manually set all the nodes to .spec.unschedulable = true, i taint the nodes, then i start evicting workloads.

in theory, this will cause the autoscaler to see the new pending pods and make more nodes.

but, if all the nodes in the node group are marked as unschedulable, then the autoscaler will not be able to produce a valid template from the observable nodes.

The more we discuss this the more I think this is a specific feature flag but not something we'd want to do by default. @towca wdyt?

i'll add this to the agenda for tomorrow's sig meeting.

k8s-ci-robot requested review from vadasambar and x13n September 10, 2025 20:21

k8s-ci-robot added the area/cluster-autoscaler label Sep 10, 2025

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed do-not-merge/needs-area labels Sep 10, 2025

elmiko force-pushed the unschedulable-nodes-fix branch from a0ebb28 to 3270172 Compare October 2, 2025 20:50

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 2, 2025

elmiko changed the title ~~WIP update to include unschedulable nodes~~ update node info processors to include unschedulable nodes Oct 2, 2025

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 2, 2025

elmiko commented Oct 3, 2025

View reviewed changes

elmiko force-pushed the unschedulable-nodes-fix branch from 3270172 to cb2649a Compare October 3, 2025 16:37

elmiko force-pushed the unschedulable-nodes-fix branch from cb2649a to fd53c0b Compare October 3, 2025 16:59

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 7, 2025

elmiko force-pushed the unschedulable-nodes-fix branch from fd53c0b to 906a939 Compare October 8, 2025 18:44

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 8, 2025

jackfrancis reviewed Oct 16, 2025

View reviewed changes

cluster-autoscaler/processors/nodeinfosprovider/mixed_nodeinfos_processor.go Outdated Show resolved Hide resolved

elmiko added 2 commits October 22, 2025 11:06

update createSanitizedNode to set unschedulable

b87fd2c

This change ensures that a sanitized node has its .spec.unschedulable field set to false.

elmiko force-pushed the unschedulable-nodes-fix branch from 906a939 to 5244a8f Compare October 22, 2025 15:30

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 22, 2025

towca reviewed Oct 22, 2025

View reviewed changes

update node info processors to include unschedulable nodes #8520

Are you sure you want to change the base?

update node info processors to include unschedulable nodes #8520

Conversation

elmiko commented Sep 10, 2025

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented Sep 10, 2025

Uh oh!

elmiko commented Sep 10, 2025

Uh oh!

elmiko commented Oct 2, 2025

Uh oh!

elmiko commented Oct 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elmiko commented Oct 3, 2025

Uh oh!

elmiko commented Oct 3, 2025

Uh oh!

elmiko commented Oct 3, 2025

Uh oh!

elmiko commented Oct 8, 2025

Uh oh!

elmiko commented Oct 14, 2025

Uh oh!

Uh oh!

elmiko commented Oct 16, 2025

Uh oh!

elmiko commented Oct 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elmiko Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

elmiko Oct 22, 2025 •

edited

Loading