Skip to content

Conversation

@camilamacedo86
Copy link
Member

@camilamacedo86 camilamacedo86 commented Nov 14, 2025

Problem

When the default e2e suite is run against a project that scaffolds admission
webhooks, the "metrics" test can fail with errors such as:

Error from server (InternalError): Internal error occurred: failed calling webhook:
Post "https://...webhook-service:443/validate--v1-pod": dial tcp ...:443:
connect: connection refused

The failure happens when the test launches the curl-metrics pod to verify the
metrics endpoint.

Root Cause

The original test assumed that, once the controller pod reported Running, the
webhook server was also ready to admit new pods. In reality:

  1. The controller pod reaches Running.
  2. cert-manager is still issuing the serving certificate secret for the webhook.
  3. The webhook server only starts once that secret is created and mounted.
  4. The metrics e2e test creates the curl pod while the webhook server is still initialising.
  5. The validating webhook rejects the pod because it cannot accept connections yet.

Checking the controller logs for "Serving metrics server" does not guarantee the webhook server is listening, so the race persisted.

Solution

Gate the curl pod creation on two readiness checks:

  1. Wait for the controller pod to report the Ready condition.
  2. When webhooks are scaffolded, wait for the webhook service to publish EndpointSlice addresses (the Kubernetes-recommended readiness signal).

Controller readiness:

By("ensuring the controller pod is ready")
verifyControllerPodReady := func(g Gomega) {
    cmd := exec.Command("kubectl", "get", "pod", controllerPodName, "-n", namespace,
        "-o", "jsonpath={.status.conditions[?(@.type=='Ready')].status}")
    output, err := utils.Run(cmd)
    g.Expect(err).NotTo(HaveOccurred())
    g.Expect(output).To(Equal("True"), "Controller pod not ready")
}
Eventually(verifyControllerPodReady, 3*time.Minute, time.Second).Should(Succeed())

Webhook readiness (injected only when the project wires webhooks):

verifyWebhookEndpointsReady := func(g Gomega) {
    cmd := exec.Command("kubectl", "get", "endpointslices.discovery.k8s.io", "-n", namespace,
        "-l", "kubernetes.io/service-name=%s",
        "-o", "jsonpath={range .items[*]}{range .endpoints[*]}{.addresses[*]}{end}{end}")
    output, err := utils.Run(cmd)
    g.Expect(err).NotTo(HaveOccurred(), "Webhook endpoints should exist")
    g.Expect(output).ShouldNot(BeEmpty(), "Webhook endpoints not yet ready")
}
Eventually(verifyWebhookEndpointsReady, 3*time.Minute, time.Second).Should(Succeed())

A new scaffold marker (+kubebuilder:scaffold:e2e-metrics-webhooks-readiness) adds the EndpointSlice check only for webhook-enabled projects, keeping the base template generic.

Why this works

  • The Ready condition requires all containers, volumes (including the webhook
    cert secret), and readiness probes to succeed, so the controller and its webhook server have finished initialising.
  • EndpointSlices provide the modern, recommended signal that a service has routable endpoints, avoiding deprecated Endpoints objects.
  • The previous log-based verification that the metrics server is serving remains in place, and the curl pod still validates the metrics output once the checks pass.

Notes on Kubernetes 1.33+

Kubernetes 1.33+ may exhibit a brief delay before the metrics endpoint becomes
available when controller-runtime's WithAuthenticationAndAuthorization() is
used with self-signed certificates. The readiness sequence above absorbs that
startup delay, so the curl pod is only launched after Kubernetes itself signals
that both the controller pod and webhook service are ready.

Closes: #5138 #5137

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Nov 14, 2025
@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Nov 14, 2025
@camilamacedo86 camilamacedo86 force-pushed the fix-flakes branch 2 times, most recently from 527e712 to e6f2678 Compare November 14, 2025 18:24
@camilamacedo86 camilamacedo86 changed the title WIP: 🐛 Fix e2e test flakes when webhooks are scaffolded 🐛 Fix e2e test flakes when webhooks are scaffolded Nov 14, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 14, 2025
@camilamacedo86 camilamacedo86 changed the title 🐛 Fix e2e test flakes when webhooks are scaffolded WIP 🐛 Fix e2e test flakes when webhooks are scaffolded Nov 14, 2025
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 14, 2025
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Nov 14, 2025
@camilamacedo86 camilamacedo86 changed the title WIP 🐛 Fix e2e test flakes when webhooks are scaffolded 🐛 (go/v4): Fix flaky e2e tests in scaffolded projects in the metrics check Nov 14, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 14, 2025
@camilamacedo86
Copy link
Member Author

Hi @mayuka-c

Could you help review this one?

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 14, 2025
Copy link
Contributor

@mayuka-c mayuka-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

This looks good to me since the controller pod readiness guarantees that the webhook server is ready. The approach which I had taken directly checks for webhook readiness which has additional overhead of defining and maintaining markers so that the webhook readiness checks only runs on PROJECT which has webhooks configured.

Thank you so much @camilamacedo86 :)

@k8s-ci-robot
Copy link
Contributor

@mayuka-c: changing LGTM is restricted to collaborators

In response to this:

/lgtm

This looks good to me since the controller pod readiness guarantees that the webhook server is ready. The approach which I had taken directly checks for webhook readiness which has additional overhead of defining and maintaining markers so that the webhook readiness checks only runs on PROJECT which has webhooks configured.

Thank you so much @camilamacedo86 :)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: camilamacedo86, mayuka-c

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Nov 15, 2025
@camilamacedo86 camilamacedo86 changed the title 🐛 (go/v4): Fix flaky e2e tests in scaffolded projects in the metrics check 🐛 (go/v4): Fix flaky metrics e2e tests when webhooks are scaffolded Nov 15, 2025
g.Expect(err).NotTo(HaveOccurred(), "Webhook endpoints should exist")
g.Expect(output).ShouldNot(BeEmpty(), "Webhook endpoints not yet ready")
}
Eventually(verifyWebhookEndpointsReady, 3*time.Minute, time.Second).Should(Succeed())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mayuka-c that is the code that we must to inject.
Without it we still risking the flakes

Why?

Without waiting for the webhook service to publish endpoints, the curl pod can hit the validating webhook while it’s still initializing, producing the same “connection refused” failure. The EndpointSlice check is what gives us a deterministic signal that the webhook server is actually accepting traffic before we launch the curl metrics pod.

Copy link
Contributor

@mayuka-c mayuka-c Nov 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this would be required. I mean the approach which was taken here (To have the marker to check for webhook readiness).

Sorry, if I caused some misunderstandings here :)

Copy link
Member Author

@camilamacedo86 camilamacedo86 Nov 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at all — you helped a lot!
You identified the issue, and your effort was absolutely not lost.
I’ve added you as a co-author of this PR to give you proper credit for your contribution. 🙌

Screenshot 2025-11-15 at 16 37 52

@camilamacedo86
Copy link
Member Author

Hi @mayuka-c

Sorry, I changed it to do more tests and forgot to clarify.
I tried many options. The final solution is now in place.
Please, feel free to check

@mayuka-c
Copy link
Contributor

Hi @mayuka-c

Sorry, I changed it to do more tests and forgot to clarify. I tried many options. The final solution is now in place. Please, feel free to check

Yeah this looks perfect for me. Once this is merged, I will be closing this PR here.
Thanks :)

Projects with webhooks may experience flaky e2e tests due to webhook
server not being ready when the metrics test creates the curl-metrics pod.

Co-authored-by: Mayuka Channankaiah <github@users.noreply.github.com>
@camilamacedo86 camilamacedo86 added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 15, 2025
@k8s-ci-robot k8s-ci-robot merged commit 6a41213 into kubernetes-sigs:master Nov 15, 2025
37 checks passed
@camilamacedo86 camilamacedo86 deleted the fix-flakes branch November 15, 2025 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants