Skip to content

Conversation

@liouk
Copy link
Member

@liouk liouk commented Oct 16, 2025

No description provided.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 16, 2025
@openshift-ci-robot
Copy link
Contributor

@liouk: This pull request explicitly references no jira issue.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 16, 2025
@coderabbitai
Copy link

coderabbitai bot commented Oct 16, 2025

Walkthrough

Added stricter OIDC availability validations and debug logging: require node statuses and non-zero CurrentRevision across nodes, require observed revisions, and add debug logs around auth-config/config-%d ConfigMaps. Also added debug logging around endpointCheckDisabledFunc in the endpoint accessibility controller.

Changes

Cohort / File(s) Summary
OIDC availability validation
pkg/controllers/common/external_oidc.go, pkg/controllers/common/external_oidc_test.go
Add klog/v2 logging and stricter validations: error if kas.Status.NodeStatuses is empty; track empty per-node revisions and error if any CurrentRevision is 0; error if no observed revisions remain; require kas configmaps informer synced; log debug when expected auth-config or config-%d ConfigMaps are missing or lack OIDC config. Tests updated: expect an error for "no node statuses observed" and add cases for some/all node revisions being zero.
Endpoint accessibility logging
pkg/libs/endpointaccessible/endpoint_accessible_controller.go
Import klog/v2 and add debug logs around endpointCheckDisabledFunc: log presence, errors returned by the func, and whether checks are skipped. No control-flow or API changes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Review pkg/controllers/common/external_oidc.go for correctness of new error conditions and exact error messages.
  • Verify informer-synced checks and debug logging do not change observable behavior except added logs.
  • Run and inspect updated tests in pkg/controllers/common/external_oidc_test.go to ensure determinism and that expectations match new error returns.
  • Confirm added klog statements in pkg/libs/endpointaccessible/endpoint_accessible_controller.go use appropriate levels and avoid logging sensitive data.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The pull request contains no description provided by the author. The pass criterion for this check requires that "the description is related in some way to the changeset," but an empty or absent description cannot satisfy this requirement as it provides no information to relate to the changes. Although the check is lenient regarding level of detail, it still requires some description to exist and be connected to the changeset, which is not present in this case. Add a pull request description that explains the motivation and context for returning errors when node statuses cannot be used to determine OIDC state. Even a brief description that relates to the changeset would satisfy this check; for example, explaining the issue being addressed or the benefit of these error handling improvements would help reviewers understand the PR's purpose and context.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The title "NO-JIRA: externaloidc: return errors when node statuses cannot be used to determine oidc state" directly describes the primary changes in the pull request. The main modifications are to the external OIDC component to add validation that returns errors when node statuses are empty, contain zero revisions, or cannot be reliably used to determine OIDC state. The title is specific, clear, and accurately captures the core objective of the changeset. While secondary logging additions exist in the endpoint_accessible_controller component, they are minor compared to the main focus on OIDC error handling improvements.

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot requested a review from ibihim October 16, 2025 09:44
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 16, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liouk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 16, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
pkg/controllers/common/external_oidc.go (1)

71-78: LGTM! Logic correctly filters invalid revisions.

The conditional insertion ensures only valid (non-zero) revisions are tracked, while counting nodes with empty revisions for error reporting. This approach properly separates valid and invalid data.

One minor style nitpick:

-			numNodesWithEmptyRevision += 1
+			numNodesWithEmptyRevision++

The ++ operator is more idiomatic in Go for simple increments.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 215805c and fc58d2d.

📒 Files selected for processing (2)
  • pkg/controllers/common/external_oidc.go (1 hunks)
  • pkg/controllers/common/external_oidc_test.go (1 hunks)
🔇 Additional comments (6)
pkg/controllers/common/external_oidc.go (3)

80-82: Good validation: catch missing node status data early.

Checking for empty node statuses before proceeding prevents downstream logic from operating on incomplete data. The error message clearly identifies the root cause.


84-86: Excellent validation: reject partial/invalid node data.

Including the count of nodes with empty revisions in the error message helps operators diagnose the issue. This check ensures the function fails fast when node data is incomplete.


88-90: Approve defensive check, though technically unreachable.

This check is good defensive programming and guards against future logic changes. However, given the previous validations (lines 80-86), this condition cannot be reached in practice:

  • If len(kas.Status.NodeStatuses) == 0, line 80-82 returns early
  • If all nodes have CurrentRevision <= 0, line 84-86 returns early
  • If any nodes have CurrentRevision > 0, observedRevisions will have entries

The check serves as a safety net and is acceptable to keep, especially in a WIP PR.

pkg/controllers/common/external_oidc_test.go (3)

35-36: LGTM! Test correctly expects error for missing node statuses.

The updated expectation aligns with the new validation in OIDCAvailable() that returns an error when no node statuses are found.


37-47: LGTM! Test coverage for partial zero revisions.

This test case validates the scenario where some nodes have valid revisions while others have zero, ensuring the function correctly rejects this inconsistent state.


48-58: LGTM! Test coverage for all zero revisions.

This test case covers the scenario where all nodes have invalid (zero) revisions, confirming the function properly rejects this degenerate state.

@liouk
Copy link
Member Author

liouk commented Oct 21, 2025

/test e2e-oidc-techpreview

Comment on lines 80 to 82
if len(kas.Status.NodeStatuses) == 0 {
return false, fmt.Errorf("determining observed revisions in kubeapiservers.operator.openshift.io/cluster; no node statuses found")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move this before the for loop that iterates through the node statuses?

}

observedRevisions := sets.New[int32]()
numNodesWithEmptyRevision := 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to track this with a counter-like variable?

Presumably this is equivalent to len(kas.Status.NodeStatuses) - observedRevision.Len() if we are only tracking > 0 current revisions in observedRevision?

Copy link
Member Author

@liouk liouk Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to track this with a counter-like variable?

We can also use a bool; only reason was to add it to the log message, but I guess this doesn't add any really useful information. I'll drop this then 👍

Presumably this is equivalent to len(kas.Status.NodeStatuses) - observedRevision.Len() if we are only tracking > 0 current revisions in observedRevision?

It's not, because observedRevision tracks unique revisions (it's a set), and this condition would fail if there are nodes on the same revision.

@liouk liouk force-pushed the fix-oidc-available-condition branch from fc58d2d to 71dfa10 Compare October 23, 2025 09:14
@liouk liouk force-pushed the fix-oidc-available-condition branch from 71dfa10 to 4d280bd Compare October 23, 2025 09:14
@liouk liouk changed the title WIP: NO-JIRA: externaloidc: return errors when node statuses cannot be used to determine oidc state NO-JIRA: externaloidc: return errors when node statuses cannot be used to determine oidc state Oct 23, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 23, 2025
Comment on lines 75 to 85
nodesWithEmptyRevision := false
for _, nodeStatus := range kas.Status.NodeStatuses {
observedRevisions.Insert(nodeStatus.CurrentRevision)
if nodeStatus.CurrentRevision > 0 {
observedRevisions.Insert(nodeStatus.CurrentRevision)
} else {
nodesWithEmptyRevision = true
}
}

if nodesWithEmptyRevision {
return false, fmt.Errorf("determining observed revisions in kubeapiservers.operator.openshift.io/cluster; some nodes do not have a valid CurrentRevision")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we find one with an invalid revision, should we just return the error from within the loop, terminating it early?

As-is, I don't really see us gaining any benefit of continuing to loop once we've found at least one node with an invalid current revision.

Suggested change
nodesWithEmptyRevision := false
for _, nodeStatus := range kas.Status.NodeStatuses {
observedRevisions.Insert(nodeStatus.CurrentRevision)
if nodeStatus.CurrentRevision > 0 {
observedRevisions.Insert(nodeStatus.CurrentRevision)
} else {
nodesWithEmptyRevision = true
}
}
if nodesWithEmptyRevision {
return false, fmt.Errorf("determining observed revisions in kubeapiservers.operator.openshift.io/cluster; some nodes do not have a valid CurrentRevision")
for _, nodeStatus := range kas.Status.NodeStatuses {
if nodeStatus.CurrentRevision <= 0 {
return false, fmt.Errorf("determining observed revisions in kubeapiservers.operator.openshift.io/cluster; some nodes do not have a valid CurrentRevision")
}
observedRevisions.Insert(nodeStatus.CurrentRevision)
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course -- now that we don't use the count this is much better 👍

@xingxingxia
Copy link
Contributor

This PR is to solve the separate issue I saw in another test #798 (comment) .

Pre-merge tested this and PR #801 together within the cluster-bot. #800 is already /verified as I commented in that PR.
For this #801, I pre-merge tested as below:

# Cluster-Bot payload 1
build 4.21.0-0.nightly-2025-10-24-233040,openshift/cluster-authentication-operator#800,openshift/cluster-authentication-operator#801

# Cluster-Bot payload 2
build 4.21.0-0.nightly-2025-10-25-063101,openshift/cluster-authentication-operator#800,openshift/cluster-authentication-operator#801

Step 1
Launched a cluster with payload 1. Configured external oidc auth on the cluster. Rollout completed after waiting ~ 20m. Checked oc/console logins et al which all worked.
Step 2
At 09:47:45, starting upgrade to payload 2:

[xxia@2025-10-25 09:47:45 GMT my]$ oc adm upgrade --force --allow-explicit-upgrade --to-image registry.xxxxxxxxxxxxxxxxxxxx.org/ci-ln-3kdbf5b/release:latest # payload 2
...
Requested update to release image registry.xxxxxxxxxxxxxxxxxxxx.org/ci-ln-3kdbf5b/release:latest
[xxia@2025-10-25 09:47:49 GMT my]$

At 10:51:14, the upgrade completed:

[xxia@2025-10-25 10:51:14 GMT my]$ oc get clusterversion
NAME      VERSION                                                AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.21.0-0-2025-10-25-073555-test-ci-ln-3kdbf5b-latest   True        False         39s     Cluster version is 4.21.0-0-2025-10-25-073555-test-ci-ln-3kdbf5b-latest
[xxia@2025-10-25 10:51:16 GMT my]$

Step 3
Checked CAO logs. The issue still happened twice during upgrading, respectively at 10:14:00 and 10:29:20:

[xxia@2025-10-25 10:52:59 GMT my]$ oc get event -n openshift-authentication-operator -o json > events-openshift-authentication-operator.json
[xxia@2025-10-25 10:53:04 GMT my]$ cat events-openshift-authentication-operator.json | jq -r '.items[] | select(.message | test ("Available changed from")) | "\(.firstTimestamp) \(.count) \(.message)"'
...
2025-10-25T10:14:00Z 1 Status for clusteroperator/authentication changed: Available changed from True to False ("OAuthServerServiceEndpointAccessibleControllerAvailable: service \"oauth-openshift\" not found"),status.relatedObjects changed from [{"route.openshift.io" "routes" "openshift-authentication" "oauth-openshift"} {"" "services" "openshift-authentication" "oauth-openshift"} {"operator.openshift.io" "authentications" "" "cluster"} {"config.openshift.io" "authentications" "" "cluster"} {"config.openshift.io" "infrastructures" "" "cluster"} {"config.openshift.io" "oauths" "" "cluster"} {"" "namespaces" "" "openshift-config"} {"" "namespaces" "" "openshift-config-managed"} {"" "namespaces" "" "openshift-authentication"} {"" "namespaces" "" "openshift-authentication-operator"} {"" "namespaces" "" "openshift-ingress"} {"" "namespaces" "" "openshift-oauth-apiserver"}] to [{"operator.openshift.io" "authentications" "" "cluster"} {"config.openshift.io" "authentications" "" "cluster"} {"config.openshift.io" "infrastructures" "" "cluster"} {"config.openshift.io" "oauths" "" "cluster"} {"" "namespaces" "" "openshift-config"} {"" "namespaces" "" "openshift-config-managed"} {"" "namespaces" "" "openshift-authentication"} {"" "namespaces" "" "openshift-authentication-operator"} {"" "namespaces" "" "openshift-ingress"} {"" "namespaces" "" "openshift-oauth-apiserver"}]
2025-10-25T10:14:01Z 1 Status for clusteroperator/authentication changed: Available changed from False to True ("All is well")
2025-10-25T10:29:20Z 1 Status for clusteroperator/authentication changed: Available changed from True to False ("OAuthServerServiceEndpointAccessibleControllerAvailable: service \"oauth-openshift\" not found")
2025-10-25T10:29:23Z 1 Status for clusteroperator/authentication changed: Available changed from False to True ("All is well")

So the verification fails. @liouk

@liouk
Copy link
Member Author

liouk commented Nov 3, 2025

Added debug logging to investigate the issue found by @xingxingxia.

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 3, 2025
@liouk liouk force-pushed the fix-oidc-available-condition branch from fb40473 to 90f2f82 Compare November 4, 2025 14:11
@liouk liouk force-pushed the fix-oidc-available-condition branch from 90f2f82 to 702bf57 Compare November 6, 2025 09:11
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
pkg/controllers/common/external_oidc.go (1)

79-120: Use a verbose log level for the new debug statements. These [debug-801] messages now fire on every sync for each node and missing configmap at the default INFO verbosity, which will spam controller logs. Please gate them behind a higher verbosity level (e.g. klog.V(4)) or add an explicit verbosity check.

-			klog.Infof("[debug-801] node '%s' is on revision %d", nodeStatus.NodeName, nodeStatus.CurrentRevision)
+			klog.V(4).Infof("[debug-801] node '%s' is on revision %d", nodeStatus.NodeName, nodeStatus.CurrentRevision)
@@
-			klog.Infof("[debug-801] configmap auth-config-%d not found; informer HasSynced=%v", revision, c.kasNamespaceConfigMapsInformer.HasSynced())
+			klog.V(4).Infof("[debug-801] configmap auth-config-%d not found; informer HasSynced=%v", revision, c.kasNamespaceConfigMapsInformer.HasSynced())
@@
-			klog.Infof("[debug-801] configmap config-%d does not contain expected OIDC config", revision)
+			klog.V(4).Infof("[debug-801] configmap config-%d does not contain expected OIDC config", revision)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 90f2f82 and 702bf57.

📒 Files selected for processing (2)
  • pkg/controllers/common/external_oidc.go (3 hunks)
  • pkg/libs/endpointaccessible/endpoint_accessible_controller.go (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/libs/endpointaccessible/endpoint_accessible_controller.go

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 6, 2025

@liouk: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/unit 702bf57 link true /test unit
ci/prow/okd-scos-e2e-aws-ovn 702bf57 link false /test okd-scos-e2e-aws-ovn
ci/prow/test-operator-integration 702bf57 link false /test test-operator-integration
ci/prow/e2e-agnostic-upgrade 702bf57 link true /test e2e-agnostic-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants