KEP-2371: update to beta #5632

haircommander · 2025-10-07T20:28:45Z

One-line PR description:

Issue link: cAdvisor-less, CRI-full Container and Pod Stats #2371

Other comments:

k8s-ci-robot · 2025-10-07T20:28:54Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: haircommander
Once this PR has been reviewed and has the lgtm label, please assign johnbelamaric for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/prod-readiness/OWNERS
keps/sig-node/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kannon92 · 2025-10-08T13:18:02Z

keps/sig-node/2371-cri-pod-container-stats/README.md

 - Conduct research to find the set of metrics from `/metrics/cadvisor` that compliant CRI implementations must expose.

-#### Alpha -> Beta Graduation
+#### Beta


Will these be done before feature gate is turned on?

The requirements for PRR have changed so ideally most of the work is complete when this is promoted to beta.

yeah that's my hope. The containerd side of the verification may need to be done on a prerelease version, but since a lot of the testing will be manual I think that will be okay cc @akhilerm

Yes @haircommander . Will have to manually verify and I am trying to get it merged in before the next beta of containerd 2.2

Conformance tests for the fields in /metrics/cadvisor should be created.

That sounds like the tests will be automated.

Conformance tests for the fields in /metrics/cadvisor should be created.

Validate performance impact of this feature is within allowable margin (or non-existent, ideally).

The CRI stats implementation should perform better than they did with CRI+cAdvisor.

cAdvisor stats provider will be marked as deprecated, as well as the cAdvisor providing the metrics endpoint /metrics/cadvisor.

Write migration documentation for entities relying on metrics from /metrics/cadvisor.

Windows stats and metrics will be added.

This seems like a lot to do for beta promotion. Is the plan to do all of this in 1.35 cycle?

It sounds like performance impact of this is also container runtime dependent so I'd expect performance numbers for CRI-O / Containerd.

btw we do have e2e_node test for it now https://github.com/kubernetes/kubernetes/blob/b393d87d16f225f873f72a79734b3409323b4a05/test/e2e_node/container_metrics_test.go#L39 so we'd have to find a way to transfer to conformance
I'm dropping "Write migration documentation for entities relying on metrics from /metrics/cadvisor." as we have changed the implementation to use /metrics/cadvisor still
I am also dropping windows piece, SIG windows can do a follow-up KEP for that

I guess its not clear to me what this means actually.

https://github.com/kubernetes/kubernetes/blob/b393d87d16f225f873f72a79734b3409323b4a05/test/e2e_node/container_metrics_test.go#L36

This test is marked as "NodeConformance" in test/e2e_node. Are you proposing that we move this test to test/e2e/ and move it to the conformance tests for a k8s distribution?

keps/sig-node/2371-cri-pod-container-stats/README.md

kannon92

out of diff but is this still an open question?

https://github.com/kubernetes/enhancements/blob/f04c6991969c25f117686f065bd761493e404d08/keps/sig-node/2371-cri-pod-container-stats/README.md#open-questions

kannon92 · 2025-10-09T15:16:07Z

keps/sig-node/2371-cri-pod-container-stats/README.md

 Ideally all components will rely on summary API thereby alleviating need for cAdvisor for container and pod level stats.
 This is also a requirement to be able to disable cAdvisor container metrics collection.

+To make clear to cluster admins when metrics are coming from CRI, rather than cadvisor, a new metric `kubelet_metrics_provider` will be used, with `provider` label either `cri` or `cadvisor`.


If there are issues with kubelet metrics provider do you think its worth exposing this in the metric?

I would advocate the specific providers should report their own error metrics

keps/sig-node/2371-cri-pod-container-stats/README.md

kannon92 · 2025-10-09T15:19:09Z

keps/sig-node/2371-cri-pod-container-stats/README.md

-* **What specific metrics should inform a rollback?**
+###### What specific metrics should inform a rollback?

 The lack of any metrics reported for pods and containers is the worst case scenerio here, and would require either a rollback or for the feature gate to be disabled.


Commented above but if kubelet provider is not working, should we expose a metric or something?

If Kubelet is unable to post metrics on a node, it seems difficult to find this out currently.

I think if the admin attempted to roll out the feature and it failed, the metric saying provider is 'cadvisor' unexpectedly would be the signal that the fallback happened

That makes sense and the metric is exposed per node?

IMO this is still a very difficult thing for someone to detect.

The lack of any metrics reported for pods and containers is the worst case scenerio here, and would require either a rollback or for the feature gate to be disabled.

So the only way someone who find this out is if a kubelet on a node stopped posting metrics and that pod/container on that node was not found in prometheus.

That seems very complicated to tell if I had 5000 nodes.

Its worth calling out that the rollback failing would be cadvisor but if the metrics are not being posted then what is the best way to find that out? How does one find the bad node via metrics or monitoring?

keps/sig-node/2371-cri-pod-container-stats/README.md

kannon92 · 2025-10-09T15:21:41Z

Please address the verify job failure.

PRR shadow:

I left some comments but overall I think it is close.

Signed-off-by: Peter Hunt <pehunt@redhat.com>

haircommander · 2025-10-09T17:22:58Z

thanks @kannon92 updated!

kannon92 · 2025-10-10T17:33:54Z

keps/sig-node/2371-cri-pod-container-stats/README.md

+- e2e_node tests for the fields in `/metrics/cadvisor` should be created.
 - Validate performance impact of this feature is within allowable margin (or non-existent, ideally).
 	- The CRI stats implementation should perform better than they did with CRI+cAdvisor.
 - cAdvisor stats provider will be marked as deprecated, as well as the cAdvisor providing the metrics endpoint `/metrics/cadvisor`.


How will the deprecation be announced and exposed to the cluster admin?

cAdvisor stats provider support will be dropped

My concern is that we will have to give proper notice to remove the cadvisor stats provider.

I am curious if we really need to put removal of cadvisor stats provider in this KEP. I heard that the way dockershim was done caused a lot of problems among end users so not sure if dockershim is the right path to follow here.

kannon92 · 2025-10-10T17:40:44Z

keps/sig-node/2371-cri-pod-container-stats/README.md

+###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

 A piece of work for Beta is moving the source of the contents of `/metrics/cadvisor`. If users toggle the feature gate,
 prometheus collectors will have to move the URL. However, it's an expressed intention of the implementation to have the CRI


I think you should answer yes here. You want to deprecate and remove cadvisor stats provider, no?

kannon92 · 2025-10-10T17:42:51Z

keps/sig-node/2371-cri-pod-container-stats/README.md

-* **What steps should be taken if SLOs are not being met to determine the problem?**
+###### What steps should be taken if SLOs are not being met to determine the problem?
+
+A cluster admin can investigate the CRI implementation, or revert the feature gate to fallback to cAdvisor.


This would only be valid for Beta though. This answer would not be possible to do if this feature went GA.

If there is a performance issue or someone wants to turn this feature off, does that need to be supported?

or is the goal to make sure that this feature should never be disabled when GA?

A cluster admin can investigate the CRI implementation

I think this answer is coming from a cluster admin who has little knowledge of CRI implementations. What exactly would they do here if this feature misbehaved?

kannon92 · 2025-10-10T17:44:58Z

keps/sig-node/2371-cri-pod-container-stats/README.md

    - Usage description:
      - Impact of its outage on the feature: The feature, as well as many other pieces of Kubernetes, would not work, as the CRI implementation is vital to the creation and running of Pods.
      - Impact of its degraded performance or high-error rates on the feature: All Kuberetes operations will slow down if the CRI spends too much energy in getting the stats.
+      - All supported CRI-O versions have this feature, but containerd must be version 2.2 or later.


How will the rollout of this feature go for containerd implementations that do not support this?

Will these clusters fall back to cadvisor stats provider even if the feature gate is enabled?

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 7, 2025

k8s-ci-robot requested review from dchen1107 and palnabarun October 7, 2025 20:28

k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Oct 7, 2025

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 7, 2025

haircommander mentioned this pull request Oct 7, 2025

cAdvisor-less, CRI-full Container and Pod Stats #2371

Open

17 tasks

kannon92 reviewed Oct 8, 2025

View reviewed changes

kannon92 reviewed Oct 9, 2025

View reviewed changes

keps/sig-node/2371-cri-pod-container-stats/README.md Show resolved Hide resolved

keps/sig-node/2371-cri-pod-container-stats/README.md Show resolved Hide resolved

kannon92 reviewed Oct 9, 2025

View reviewed changes

keps/sig-node/2371-cri-pod-container-stats/README.md Outdated Show resolved Hide resolved

kannon92 reviewed Oct 9, 2025

View reviewed changes

keps/sig-node/2371-cri-pod-container-stats/README.md Show resolved Hide resolved

kannon92 reviewed Oct 9, 2025

View reviewed changes

keps/sig-node/2371-cri-pod-container-stats/README.md Show resolved Hide resolved

kannon92 reviewed Oct 9, 2025

View reviewed changes

keps/sig-node/2371-cri-pod-container-stats/README.md Show resolved Hide resolved

KEP-2371: update to beta

0c5a6c4

Signed-off-by: Peter Hunt <pehunt@redhat.com>

haircommander force-pushed the 2371-beta-2 branch from f04c699 to 0c5a6c4 Compare October 9, 2025 17:20

kannon92 reviewed Oct 10, 2025

View reviewed changes

KEP-2371: update to beta #5632

Are you sure you want to change the base?

KEP-2371: update to beta #5632

Conversation

haircommander commented Oct 7, 2025

Uh oh!

k8s-ci-robot commented Oct 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kannon92 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kannon92 commented Oct 9, 2025

Uh oh!

haircommander commented Oct 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants