OCPBUGS-26404: Add retry logic for SNO cluster detection in leader election configuration #1210

jianzhangbjz · 2026-01-28T06:41:50Z

Problem:

During package-server-manager startup, the code attempts to detect if the cluster is SNO (Single Node OpenShift) to use appropriate leader election values. Previously, this used a single 3-second timeout with no retry. If the API server was slow to respond during startup (common in SNO environments), the detection would fail and incorrectly default to HA leader election values.

Solution:

Added retry logic using wait.PollUntilContextTimeout that retries every 2 seconds for up to 30 seconds
Updated log messages to clarify the intent: detecting SNO cluster topology rather than "getting infrastructure status"
Falls back to HA values only after all retries are exhausted (safe default since HA values work on both HA and SNO clusters)

Assissted-By: Claude-Code

…ation

openshift-ci-robot · 2026-01-28T06:41:57Z

@jianzhangbjz: This pull request references Jira Issue OCPBUGS-26404, which is invalid:

expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Problem:

During package-server-manager startup, the code attempts to detect if the cluster is SNO (Single Node OpenShift) to use appropriate leader election values. Previously, this used a single 3-second timeout with no retry. If the API server was slow to respond during startup (common in SNO environments), the detection would fail and incorrectly default to HA leader election values.

Solution:

Added retry logic using wait.PollUntilContextTimeout that retries every 2 seconds for up to 30 seconds

Updated log messages to clarify the intent: detecting SNO cluster topology rather than "getting infrastructure status"

Falls back to HA values only after all retries are exhausted (safe default since HA values work on both HA and SNO clusters)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-01-28T06:42:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jianzhangbjz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~DOWNSTREAM_OWNERS~~ [jianzhangbjz]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2026-01-28T06:42:54Z

@jianzhangbjz: This pull request references Jira Issue OCPBUGS-26404, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.22.0) matches configured target version for branch (4.22.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (jiazha@redhat.com), skipping review request.

Details

In response to this:

Problem:

During package-server-manager startup, the code attempts to detect if the cluster is SNO (Single Node OpenShift) to use appropriate leader election values. Previously, this used a single 3-second timeout with no retry. If the API server was slow to respond during startup (common in SNO environments), the detection would fail and incorrectly default to HA leader election values.

Solution:

Added retry logic using wait.PollUntilContextTimeout that retries every 2 seconds for up to 30 seconds

Updated log messages to clarify the intent: detecting SNO cluster topology rather than "getting infrastructure status"

Falls back to HA values only after all retries are exhausted (safe default since HA values work on both HA and SNO clusters)

Assissted-By: Claude-Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

jianzhangbjz · 2026-01-28T06:43:04Z

/jira refresh

openshift-ci-robot · 2026-01-28T06:43:08Z

@jianzhangbjz: This pull request references Jira Issue OCPBUGS-26404, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.22.0) matches configured target version for branch (4.22.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (jiazha@redhat.com), skipping review request.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

tmshort · 2026-01-28T14:44:37Z

/retest

jianzhangbjz · 2026-01-29T01:51:23Z

/retest-required

openshift-ci · 2026-01-29T05:07:57Z

@jianzhangbjz: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

jianzhangbjz · 2026-01-29T08:31:12Z

/payload-job periodic-ci-openshift-openshift-tests-private-release-4.22-amd64-nightly-baremetal-sno-ipv4-etcd-encryption-rt-kernel-basecap-f7

openshift-ci · 2026-01-29T08:31:18Z

@jianzhangbjz: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-openshift-tests-private-release-4.22-amd64-nightly-baremetal-sno-ipv4-etcd-encryption-rt-kernel-basecap-f7

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/df5ada00-fcec-11f0-8c55-0e5f6c8f9482-0

jianzhangbjz · 2026-01-29T10:33:31Z

Test passed. Details:

1. Build an OCP cluster with this PR
launch 4.22,openshift/operator-framework-olm#1210 aws,single-node

jiazha-mac:openshift-tests-private jiazha$ oc get nodes
NAME                                        STATUS   ROLES                         AGE   VERSION
ip-10-0-72-237.us-west-1.compute.internal   Ready    control-plane,master,worker   80m   v1.34.2

jiazha-mac:~ jiazha$ oc get clusterversion 
NAME      VERSION                                                AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.22.0-0-2026-01-29-085653-test-ci-ln-592429t-latest   True        False         49m     Cluster version is 4.22.0-0-2026-01-29-085653-test-ci-ln-592429t-latest
jiazha-mac:~ jiazha$ 

2. Run test case
jiazha-mac:openshift-tests-private jiazha$ ./bin/extended-platform-tests run all --dry-run |grep 49352|./bin/extended-platform-tests run -f -
  Jan 29 18:30:24.974: INFO: The --provider flag is not set. Continuing as if --provider=skeleton had been used.
started: (0/1/1) "[sig-operators] OLM should NonHyperShiftHOST-Author:jiazha-Medium-49352-SNO Leader election conventions for cluster topology"

  I0129 18:30:41.144053   59607 trace.go:236] Trace[376040953]: "Reflector ListAndWatch" name:k8s.io/client-go@v0.0.0-20230523190412-013d8779845c/tools/cache/reflector.go:231 (29-Jan-2026 18:30:27.043) (total time: 14100ms):
  Trace[376040953]: ---"Objects listed" error:<nil> 14100ms (18:30:41.143)
  Trace[376040953]: [14.100470708s] [14.100470708s] END
  Jan 29 18:30:31.904: INFO: The --provider flag is not set. Continuing as if --provider=skeleton had been used.
  Jan 29 18:30:35.737: INFO: configPath is now "/var/folders/5n/w9ysf4w93jnfy7k19xxct31c0000gn/T/configfile2179051669"
  Jan 29 18:30:35.737: INFO: The user is now "e2e-test-default-vzge2a6h-wh7cg-user"
  Jan 29 18:30:35.737: INFO: Creating project "e2e-test-default-vzge2a6h-wh7cg"
  Jan 29 18:30:36.043: INFO: Waiting on permissions in project "e2e-test-default-vzge2a6h-wh7cg" ...
  Jan 29 18:30:37.735: INFO: Waiting for ServiceAccount "default" to be provisioned...
  Jan 29 18:30:38.364: INFO: Waiting for ServiceAccount "builder" to be provisioned...
  Jan 29 18:30:38.702: INFO: Waiting for ServiceAccount "deployer" to be provisioned...
  Jan 29 18:30:39.762: INFO: Waiting for RoleBinding "system:image-pullers" to be provisioned...
  Jan 29 18:30:40.573: INFO: Waiting for RoleBinding "system:image-builders" to be provisioned...
  Jan 29 18:30:41.269: INFO: Waiting for RoleBinding "system:deployers" to be provisioned...
  Jan 29 18:30:41.737: INFO: Project "e2e-test-default-vzge2a6h-wh7cg" has been fully provisioned.
  STEP: 1) get the cluster topology 01/29/26 18:30:41.738
  Jan 29 18:30:41.739: INFO: Running 'oc --kubeconfig=/Users/jiazha/bot-kubeconfig get infrastructures cluster -o=jsonpath={.status.controlPlaneTopology}'
  STEP: 2) get the leaseDurationSeconds of the packageserver-controller-lock 01/29/26 18:30:43.898
  Jan 29 18:30:43.898: INFO: Running 'oc --kubeconfig=/Users/jiazha/bot-kubeconfig get lease packageserver-controller-lock -n openshift-operator-lifecycle-manager -o=jsonpath={.spec.leaseDurationSeconds}'
  Jan 29 18:30:45.739: INFO: This is a SNO cluster
  Jan 29 18:30:45.977: INFO: Deleted {user.openshift.io/v1, Resource=users  e2e-test-default-vzge2a6h-wh7cg-user}, err: <nil>
  Jan 29 18:30:46.212: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthclients  e2e-client-e2e-test-default-vzge2a6h-wh7cg}, err: <nil>
  Jan 29 18:30:46.445: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthaccesstokens  sha256~Ul-l_XyKvmxfoWKwvhqzhKOY58puto7qa3eZkYqgIQg}, err: <nil>

passed: (20.3s) 2026-01-29T10:30:47 "[sig-operators] OLM should NonHyperShiftHOST-Author:jiazha-Medium-49352-SNO Leader election conventions for cluster topology"

1 pass, 0 skip (20.3s)

jianzhangbjz · 2026-01-29T10:33:45Z

/lgtm
/verified by @jianzhangbjz

openshift-ci · 2026-01-29T10:33:47Z

@jianzhangbjz: you cannot LGTM your own PR.

Details

In response to this:

/lgtm
/verified by @jianzhangbjz

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci-robot · 2026-01-29T10:33:58Z

@jianzhangbjz: This PR has been marked as verified by @jianzhangbjz.

Details

In response to this:

/lgtm
/verified by @jianzhangbjz

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Add retry logic for SNO cluster detection in leader election configur…

d8f6687

…ation

openshift-ci bot requested review from oceanc80 and rashmigottipati January 28, 2026 06:42

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 28, 2026

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jan 28, 2026

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jan 29, 2026

OCPBUGS-26404: Add retry logic for SNO cluster detection in leader election configuration #1210

Are you sure you want to change the base?

OCPBUGS-26404: Add retry logic for SNO cluster detection in leader election configuration #1210

Uh oh!

Conversation

jianzhangbjz commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem:

Solution:

Uh oh!

openshift-ci-robot commented Jan 28, 2026

Problem:

Solution:

Uh oh!

openshift-ci bot commented Jan 28, 2026

Uh oh!

openshift-ci-robot commented Jan 28, 2026

Problem:

Solution:

Uh oh!

jianzhangbjz commented Jan 28, 2026

Uh oh!

openshift-ci-robot commented Jan 28, 2026

Uh oh!

tmshort commented Jan 28, 2026

Uh oh!

jianzhangbjz commented Jan 29, 2026

Uh oh!

openshift-ci bot commented Jan 29, 2026

Uh oh!

jianzhangbjz commented Jan 29, 2026

Uh oh!

openshift-ci bot commented Jan 29, 2026

Uh oh!

jianzhangbjz commented Jan 29, 2026

Uh oh!

jianzhangbjz commented Jan 29, 2026

Uh oh!

openshift-ci bot commented Jan 29, 2026

Uh oh!

openshift-ci-robot commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jianzhangbjz commented Jan 28, 2026 •

edited

Loading