NO-JIRA: externaloidc: return errors when node statuses cannot be used to determine oidc state #801

liouk · 2025-10-16T09:41:40Z

No description provided.

openshift-ci-robot · 2025-10-16T09:41:44Z

@liouk: This pull request explicitly references no jira issue.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2025-10-16T09:42:08Z

Walkthrough

Added stricter OIDC availability validations and debug logging: require node statuses and non-zero CurrentRevision across nodes, require observed revisions, and add debug logs around auth-config/config-%d ConfigMaps. Also added debug logging around endpointCheckDisabledFunc in the endpoint accessibility controller.

Changes

Cohort / File(s)	Summary
OIDC availability validation `pkg/controllers/common/external_oidc.go`, `pkg/controllers/common/external_oidc_test.go`	Add `klog/v2` logging and stricter validations: error if `kas.Status.NodeStatuses` is empty; track empty per-node revisions and error if any `CurrentRevision` is 0; error if no observed revisions remain; require kas configmaps informer synced; log debug when expected `auth-config` or `config-%d` ConfigMaps are missing or lack OIDC config. Tests updated: expect an error for "no node statuses observed" and add cases for some/all node revisions being zero.
Endpoint accessibility logging `pkg/libs/endpointaccessible/endpoint_accessible_controller.go`	Import `klog/v2` and add debug logs around `endpointCheckDisabledFunc`: log presence, errors returned by the func, and whether checks are skipped. No control-flow or API changes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Review pkg/controllers/common/external_oidc.go for correctness of new error conditions and exact error messages.
Verify informer-synced checks and debug logging do not change observable behavior except added logs.
Run and inspect updated tests in pkg/controllers/common/external_oidc_test.go to ensure determinism and that expectations match new error returns.
Confirm added klog statements in pkg/libs/endpointaccessible/endpoint_accessible_controller.go use appropriate levels and avoid logging sensitive data.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The pull request contains no description provided by the author. The pass criterion for this check requires that "the description is related in some way to the changeset," but an empty or absent description cannot satisfy this requirement as it provides no information to relate to the changes. Although the check is lenient regarding level of detail, it still requires some description to exist and be connected to the changeset, which is not present in this case.	Add a pull request description that explains the motivation and context for returning errors when node statuses cannot be used to determine OIDC state. Even a brief description that relates to the changeset would satisfy this check; for example, explaining the issue being addressed or the benefit of these error handling improvements would help reviewers understand the PR's purpose and context.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title "NO-JIRA: externaloidc: return errors when node statuses cannot be used to determine oidc state" directly describes the primary changes in the pull request. The main modifications are to the external OIDC component to add validation that returns errors when node statuses are empty, contain zero revisions, or cannot be reliably used to determine OIDC state. The title is specific, clear, and accurately captures the core objective of the changeset. While secondary logging additions exist in the endpoint_accessible_controller component, they are minor compared to the main focus on OIDC error handling improvements.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2025-10-16T09:44:40Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liouk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [liouk]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

pkg/controllers/common/external_oidc.go (1)
71-78: LGTM! Logic correctly filters invalid revisions.

The conditional insertion ensures only valid (non-zero) revisions are tracked, while counting nodes with empty revisions for error reporting. This approach properly separates valid and invalid data.

One minor style nitpick:
-			numNodesWithEmptyRevision += 1
+			numNodesWithEmptyRevision++
The ++ operator is more idiomatic in Go for simple increments.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 215805c and fc58d2d.

📒 Files selected for processing (2)

pkg/controllers/common/external_oidc.go (1 hunks)
pkg/controllers/common/external_oidc_test.go (1 hunks)

🔇 Additional comments (6)

pkg/controllers/common/external_oidc.go (3)

80-82: Good validation: catch missing node status data early.

Checking for empty node statuses before proceeding prevents downstream logic from operating on incomplete data. The error message clearly identifies the root cause.

84-86: Excellent validation: reject partial/invalid node data.

Including the count of nodes with empty revisions in the error message helps operators diagnose the issue. This check ensures the function fails fast when node data is incomplete.

88-90: Approve defensive check, though technically unreachable.

This check is good defensive programming and guards against future logic changes. However, given the previous validations (lines 80-86), this condition cannot be reached in practice:

If len(kas.Status.NodeStatuses) == 0, line 80-82 returns early

If all nodes have CurrentRevision <= 0, line 84-86 returns early

If any nodes have CurrentRevision > 0, observedRevisions will have entries

The check serves as a safety net and is acceptable to keep, especially in a WIP PR.

pkg/controllers/common/external_oidc_test.go (3)

35-36: LGTM! Test correctly expects error for missing node statuses.

The updated expectation aligns with the new validation in OIDCAvailable() that returns an error when no node statuses are found.

37-47: LGTM! Test coverage for partial zero revisions.

This test case validates the scenario where some nodes have valid revisions while others have zero, ensuring the function correctly rejects this inconsistent state.

48-58: LGTM! Test coverage for all zero revisions.

This test case covers the scenario where all nodes have invalid (zero) revisions, confirming the function properly rejects this degenerate state.

liouk · 2025-10-21T08:37:22Z

/test e2e-oidc-techpreview

everettraven · 2025-10-21T13:12:25Z

pkg/controllers/common/external_oidc.go

+	if len(kas.Status.NodeStatuses) == 0 {
+		return false, fmt.Errorf("determining observed revisions in kubeapiservers.operator.openshift.io/cluster; no node statuses found")
+	}


Could we move this before the for loop that iterates through the node statuses?

everettraven · 2025-10-21T13:14:36Z

pkg/controllers/common/external_oidc.go

 	}

 	observedRevisions := sets.New[int32]()
+	numNodesWithEmptyRevision := 0


Do we need to track this with a counter-like variable?

Presumably this is equivalent to len(kas.Status.NodeStatuses) - observedRevision.Len() if we are only tracking > 0 current revisions in observedRevision?

Do we need to track this with a counter-like variable?

We can also use a bool; only reason was to add it to the log message, but I guess this doesn't add any really useful information. I'll drop this then 👍

Presumably this is equivalent to len(kas.Status.NodeStatuses) - observedRevision.Len() if we are only tracking > 0 current revisions in observedRevision?

It's not, because observedRevision tracks unique revisions (it's a set), and this condition would fail if there are nodes on the same revision.

…rmine oidc state

everettraven · 2025-10-24T11:28:55Z

pkg/controllers/common/external_oidc.go

+	nodesWithEmptyRevision := false
 	for _, nodeStatus := range kas.Status.NodeStatuses {
-		observedRevisions.Insert(nodeStatus.CurrentRevision)
+		if nodeStatus.CurrentRevision > 0 {
+			observedRevisions.Insert(nodeStatus.CurrentRevision)
+		} else {
+			nodesWithEmptyRevision = true
+		}
+	}
+
+	if nodesWithEmptyRevision {
+		return false, fmt.Errorf("determining observed revisions in kubeapiservers.operator.openshift.io/cluster; some nodes do not have a valid CurrentRevision")


If we find one with an invalid revision, should we just return the error from within the loop, terminating it early?

As-is, I don't really see us gaining any benefit of continuing to loop once we've found at least one node with an invalid current revision.

Suggested change

nodesWithEmptyRevision := false

for _, nodeStatus := range kas.Status.NodeStatuses {

observedRevisions.Insert(nodeStatus.CurrentRevision)

if nodeStatus.CurrentRevision > 0 {

observedRevisions.Insert(nodeStatus.CurrentRevision)

} else {

nodesWithEmptyRevision = true

}

}

if nodesWithEmptyRevision {

return false, fmt.Errorf("determining observed revisions in kubeapiservers.operator.openshift.io/cluster; some nodes do not have a valid CurrentRevision")

for _, nodeStatus := range kas.Status.NodeStatuses {

if nodeStatus.CurrentRevision <= 0 {

return false, fmt.Errorf("determining observed revisions in kubeapiservers.operator.openshift.io/cluster; some nodes do not have a valid CurrentRevision")

}

observedRevisions.Insert(nodeStatus.CurrentRevision)

}

Of course -- now that we don't use the count this is much better 👍

xingxingxia · 2025-10-25T12:12:19Z

This PR is to solve the separate issue I saw in another test #798 (comment) .

Pre-merge tested this and PR #801 together within the cluster-bot. #800 is already /verified as I commented in that PR.
For this #801, I pre-merge tested as below:

# Cluster-Bot payload 1
build 4.21.0-0.nightly-2025-10-24-233040,openshift/cluster-authentication-operator#800,openshift/cluster-authentication-operator#801

# Cluster-Bot payload 2
build 4.21.0-0.nightly-2025-10-25-063101,openshift/cluster-authentication-operator#800,openshift/cluster-authentication-operator#801

Step 1
Launched a cluster with payload 1. Configured external oidc auth on the cluster. Rollout completed after waiting ~ 20m. Checked oc/console logins et al which all worked.
Step 2
At 09:47:45, starting upgrade to payload 2:

[xxia@2025-10-25 09:47:45 GMT my]$ oc adm upgrade --force --allow-explicit-upgrade --to-image registry.xxxxxxxxxxxxxxxxxxxx.org/ci-ln-3kdbf5b/release:latest # payload 2
...
Requested update to release image registry.xxxxxxxxxxxxxxxxxxxx.org/ci-ln-3kdbf5b/release:latest
[xxia@2025-10-25 09:47:49 GMT my]$

At 10:51:14, the upgrade completed:

[xxia@2025-10-25 10:51:14 GMT my]$ oc get clusterversion
NAME      VERSION                                                AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.21.0-0-2025-10-25-073555-test-ci-ln-3kdbf5b-latest   True        False         39s     Cluster version is 4.21.0-0-2025-10-25-073555-test-ci-ln-3kdbf5b-latest
[xxia@2025-10-25 10:51:16 GMT my]$

Step 3
Checked CAO logs. The issue still happened twice during upgrading, respectively at 10:14:00 and 10:29:20:

[xxia@2025-10-25 10:52:59 GMT my]$ oc get event -n openshift-authentication-operator -o json > events-openshift-authentication-operator.json
[xxia@2025-10-25 10:53:04 GMT my]$ cat events-openshift-authentication-operator.json | jq -r '.items[] | select(.message | test ("Available changed from")) | "\(.firstTimestamp) \(.count) \(.message)"'
...
2025-10-25T10:14:00Z 1 Status for clusteroperator/authentication changed: Available changed from True to False ("OAuthServerServiceEndpointAccessibleControllerAvailable: service \"oauth-openshift\" not found"),status.relatedObjects changed from [{"route.openshift.io" "routes" "openshift-authentication" "oauth-openshift"} {"" "services" "openshift-authentication" "oauth-openshift"} {"operator.openshift.io" "authentications" "" "cluster"} {"config.openshift.io" "authentications" "" "cluster"} {"config.openshift.io" "infrastructures" "" "cluster"} {"config.openshift.io" "oauths" "" "cluster"} {"" "namespaces" "" "openshift-config"} {"" "namespaces" "" "openshift-config-managed"} {"" "namespaces" "" "openshift-authentication"} {"" "namespaces" "" "openshift-authentication-operator"} {"" "namespaces" "" "openshift-ingress"} {"" "namespaces" "" "openshift-oauth-apiserver"}] to [{"operator.openshift.io" "authentications" "" "cluster"} {"config.openshift.io" "authentications" "" "cluster"} {"config.openshift.io" "infrastructures" "" "cluster"} {"config.openshift.io" "oauths" "" "cluster"} {"" "namespaces" "" "openshift-config"} {"" "namespaces" "" "openshift-config-managed"} {"" "namespaces" "" "openshift-authentication"} {"" "namespaces" "" "openshift-authentication-operator"} {"" "namespaces" "" "openshift-ingress"} {"" "namespaces" "" "openshift-oauth-apiserver"}]
2025-10-25T10:14:01Z 1 Status for clusteroperator/authentication changed: Available changed from False to True ("All is well")
2025-10-25T10:29:20Z 1 Status for clusteroperator/authentication changed: Available changed from True to False ("OAuthServerServiceEndpointAccessibleControllerAvailable: service \"oauth-openshift\" not found")
2025-10-25T10:29:23Z 1 Status for clusteroperator/authentication changed: Available changed from False to True ("All is well")

So the verification fails. @liouk

liouk · 2025-11-03T09:02:38Z

Added debug logging to investigate the issue found by @xingxingxia.

/hold

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

pkg/controllers/common/external_oidc.go (1)
79-120: Use a verbose log level for the new debug statements. These [debug-801] messages now fire on every sync for each node and missing configmap at the default INFO verbosity, which will spam controller logs. Please gate them behind a higher verbosity level (e.g. klog.V(4)) or add an explicit verbosity check.
-			klog.Infof("[debug-801] node '%s' is on revision %d", nodeStatus.NodeName, nodeStatus.CurrentRevision)
+			klog.V(4).Infof("[debug-801] node '%s' is on revision %d", nodeStatus.NodeName, nodeStatus.CurrentRevision)
@@
-			klog.Infof("[debug-801] configmap auth-config-%d not found; informer HasSynced=%v", revision, c.kasNamespaceConfigMapsInformer.HasSynced())
+			klog.V(4).Infof("[debug-801] configmap auth-config-%d not found; informer HasSynced=%v", revision, c.kasNamespaceConfigMapsInformer.HasSynced())
@@
-			klog.Infof("[debug-801] configmap config-%d does not contain expected OIDC config", revision)
+			klog.V(4).Infof("[debug-801] configmap config-%d does not contain expected OIDC config", revision)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 90f2f82 and 702bf57.

📒 Files selected for processing (2)

pkg/controllers/common/external_oidc.go (3 hunks)
pkg/libs/endpointaccessible/endpoint_accessible_controller.go (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

pkg/libs/endpointaccessible/endpoint_accessible_controller.go

openshift-ci · 2025-11-06T13:01:32Z

@liouk: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/unit	`702bf57`	link	true	`/test unit`
ci/prow/okd-scos-e2e-aws-ovn	`702bf57`	link	false	`/test okd-scos-e2e-aws-ovn`
ci/prow/test-operator-integration	`702bf57`	link	false	`/test test-operator-integration`
ci/prow/e2e-agnostic-upgrade	`702bf57`	link	true	`/test e2e-agnostic-upgrade`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 16, 2025

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 16, 2025

openshift-ci bot requested a review from ibihim October 16, 2025 09:44

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 16, 2025

coderabbitai bot reviewed Oct 16, 2025

View reviewed changes

everettraven reviewed Oct 21, 2025

View reviewed changes

liouk force-pushed the fix-oidc-available-condition branch from fc58d2d to 71dfa10 Compare October 23, 2025 09:14

externaloidc: return errors when node statuses cannot be used to dete…

4d280bd

…rmine oidc state

liouk force-pushed the fix-oidc-available-condition branch from 71dfa10 to 4d280bd Compare October 23, 2025 09:14

liouk changed the title ~~WIP: NO-JIRA: externaloidc: return errors when node statuses cannot be used to determine oidc state~~ NO-JIRA: externaloidc: return errors when node statuses cannot be used to determine oidc state Oct 23, 2025

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 23, 2025

everettraven reviewed Oct 24, 2025

View reviewed changes

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 3, 2025

liouk force-pushed the fix-oidc-available-condition branch from fb40473 to 90f2f82 Compare November 4, 2025 14:11

DO-NOT-MERGE: add debug logging

702bf57

liouk force-pushed the fix-oidc-available-condition branch from 90f2f82 to 702bf57 Compare November 6, 2025 09:11

coderabbitai bot reviewed Nov 6, 2025

View reviewed changes

NO-JIRA: externaloidc: return errors when node statuses cannot be used to determine oidc state #801

Are you sure you want to change the base?

NO-JIRA: externaloidc: return errors when node statuses cannot be used to determine oidc state #801

Conversation

liouk commented Oct 16, 2025

Uh oh!

openshift-ci-robot commented Oct 16, 2025

Uh oh!

coderabbitai bot commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

openshift-ci bot commented Oct 16, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

liouk commented Oct 21, 2025

Uh oh!

everettraven Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

everettraven Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

liouk Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

everettraven Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

liouk Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

xingxingxia commented Oct 25, 2025

Uh oh!

liouk commented Nov 3, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

coderabbitai bot commented Oct 16, 2025 •

edited

Loading

liouk Oct 23, 2025 •

edited

Loading