Skip to content

Conversation

MartynaGrotek
Copy link

  • One-line PR description: Cluster Autoscaler Pod Condition
  • Other comments:

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 6, 2025
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. labels Oct 6, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @MartynaGrotek!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 6, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @MartynaGrotek. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Oct 6, 2025
@MartynaGrotek MartynaGrotek force-pushed the ca_pod_conditions branch 2 times, most recently from ba2c173 to fe05733 Compare October 8, 2025 08:42
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: MartynaGrotek
Once this PR has been reviewed and has the lgtm label, please assign johnbelamaric for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jonathan-innis
Copy link

cc: @jmdeal @engedaam @DerekFrank

Copy link

@jonathan-innis jonathan-innis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd personally like this KEP to be generalized to all autoscalers and provide generic guidance for when/how this condition type gets set so that current or new autoscalers outside of CAS can also take the guidance to implement the condition type


As a user, I want to have an easy and reliable way to investigate why my pods are stuck in the Pending phase.

#### Story 2

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should consider expanding these user stories out with use-cases from scheduling. I'd imagine there is more that can be had here beyond just observability. For instance, I wonder if there are any integrations with the current Workload Scheduling work that's happening or interactions that we could consider here with nominatedNodeName to make it so that the scheduler is aware that a node is actively launching for this pod and the scheduler should wait to schedule it.


How will security be reviewed, and by whom?

How will UX be reviewed, and by whom?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we should expand out on UX review a bit -- this is particularly thorny since it has to do with pod conditions which are going to be everywhere.

proposal will be implemented, this is the place to discuss them.
-->

We would emit them from the same place as corresponding k8s events, which is `EventingScaleUpStatusProcessor`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This strikes me as pretty specific to CA -- I think we mentioned that this would be aligned across all autoscalers, not just CAS. We can probably leave the implementation details as an exercise to the implementer and just talk about the API and UX

ScaleUpResult | Pod condition type & status | Pod condition reason
:---------------------------- | :-------------------------------- | :-------------------
ScaleUpSuccessful | NodeProvisioningInProgress: True |
ScaleUpError | NodeProvisioningInProgress: True | ...Error

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the world of CA, what are some examples of errors?

ScaleUpError | NodeProvisioningInProgress: True | ...Error
ScaleUpNoOptionsAvailable | NodeProvisioningInProgress: False | NoOptionsAvailable
ScaleUpNotTried | NodeProvisioningInProgress: False | NotTried
ScaleUpInCooldown | NodeProvisioningInProgress: False | InCooldown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cooldown isn't a concept for Karpenter -- I do think that we can be flexible on the reason that different autoscalers provide, though. I'd be more interested in the type here and the overall meaning of the condition type being in false or true

-->

* Provide information regarding scale up for particular pod, which can be consumed by automations and other components (eg. scheduler: https://github.com/kubernetes/enhancements/issues/3990).
* Improve observability and debuggability for human operators.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be curious to know the specifics that you are looking to add here -- in general, we've seen that events have been enough for most people so I'd be curious for you to explain what's been unreliable about the current mechanism and why you need this

Copy link

@GnatorX GnatorX Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think (at least for me) status provides a concrete "why" vs events often only shows autoscaler attempted do X but doesn't necessarily provide a final state of the pod being stuck (for example)


ScaleUpResult | Pod condition type & status | Pod condition reason
:---------------------------- | :-------------------------------- | :-------------------
ScaleUpSuccessful | NodeProvisioningInProgress: True |
Copy link

@DerekFrank DerekFrank Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In karpenter we could set the pod condition reason to the nodeclaim we launched for the pod, that might be nice observability

- `EventsOnly` - current state,
- `EventsAndConditions` - produce both - we could turn it on scalability tests and assess the impact on the overall performance,
- `ConditionsOnly` - produce only new pod conditions.
- Components depending on the feature gate: `cluster-autoscaler`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+karpenter

ScaleUpSuccessful | NodeProvisioningInProgress: True |
ScaleUpError | NodeProvisioningInProgress: True | ...Error
ScaleUpNoOptionsAvailable | NodeProvisioningInProgress: False | NoOptionsAvailable
ScaleUpNotTried | NodeProvisioningInProgress: False | NotTried

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: In CAS whats the difference between NotTried and NoOptionsAvailable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants