-
Notifications
You must be signed in to change notification settings - Fork 108
NO-JIRA: externaloidc: return errors when node statuses cannot be used to determine oidc state #801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@liouk: This pull request explicitly references no jira issue. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughAdded stricter OIDC availability validations and debug logging: require node statuses and non-zero CurrentRevision across nodes, require observed revisions, and add debug logs around auth-config/config-%d ConfigMaps. Also added debug logging around endpointCheckDisabledFunc in the endpoint accessibility controller. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (1 passed)
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: liouk The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
pkg/controllers/common/external_oidc.go (1)
71-78: LGTM! Logic correctly filters invalid revisions.The conditional insertion ensures only valid (non-zero) revisions are tracked, while counting nodes with empty revisions for error reporting. This approach properly separates valid and invalid data.
One minor style nitpick:
- numNodesWithEmptyRevision += 1 + numNodesWithEmptyRevision++The
++operator is more idiomatic in Go for simple increments.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (2)
pkg/controllers/common/external_oidc.go(1 hunks)pkg/controllers/common/external_oidc_test.go(1 hunks)
🔇 Additional comments (6)
pkg/controllers/common/external_oidc.go (3)
80-82: Good validation: catch missing node status data early.Checking for empty node statuses before proceeding prevents downstream logic from operating on incomplete data. The error message clearly identifies the root cause.
84-86: Excellent validation: reject partial/invalid node data.Including the count of nodes with empty revisions in the error message helps operators diagnose the issue. This check ensures the function fails fast when node data is incomplete.
88-90: Approve defensive check, though technically unreachable.This check is good defensive programming and guards against future logic changes. However, given the previous validations (lines 80-86), this condition cannot be reached in practice:
- If
len(kas.Status.NodeStatuses) == 0, line 80-82 returns early- If all nodes have
CurrentRevision <= 0, line 84-86 returns early- If any nodes have
CurrentRevision > 0,observedRevisionswill have entriesThe check serves as a safety net and is acceptable to keep, especially in a WIP PR.
pkg/controllers/common/external_oidc_test.go (3)
35-36: LGTM! Test correctly expects error for missing node statuses.The updated expectation aligns with the new validation in
OIDCAvailable()that returns an error when no node statuses are found.
37-47: LGTM! Test coverage for partial zero revisions.This test case validates the scenario where some nodes have valid revisions while others have zero, ensuring the function correctly rejects this inconsistent state.
48-58: LGTM! Test coverage for all zero revisions.This test case covers the scenario where all nodes have invalid (zero) revisions, confirming the function properly rejects this degenerate state.
|
/test e2e-oidc-techpreview |
| if len(kas.Status.NodeStatuses) == 0 { | ||
| return false, fmt.Errorf("determining observed revisions in kubeapiservers.operator.openshift.io/cluster; no node statuses found") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we move this before the for loop that iterates through the node statuses?
| } | ||
|
|
||
| observedRevisions := sets.New[int32]() | ||
| numNodesWithEmptyRevision := 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to track this with a counter-like variable?
Presumably this is equivalent to len(kas.Status.NodeStatuses) - observedRevision.Len() if we are only tracking > 0 current revisions in observedRevision?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to track this with a counter-like variable?
We can also use a bool; only reason was to add it to the log message, but I guess this doesn't add any really useful information. I'll drop this then 👍
Presumably this is equivalent to
len(kas.Status.NodeStatuses) - observedRevision.Len()if we are only tracking > 0 current revisions inobservedRevision?
It's not, because observedRevision tracks unique revisions (it's a set), and this condition would fail if there are nodes on the same revision.
fc58d2d to
71dfa10
Compare
71dfa10 to
4d280bd
Compare
| nodesWithEmptyRevision := false | ||
| for _, nodeStatus := range kas.Status.NodeStatuses { | ||
| observedRevisions.Insert(nodeStatus.CurrentRevision) | ||
| if nodeStatus.CurrentRevision > 0 { | ||
| observedRevisions.Insert(nodeStatus.CurrentRevision) | ||
| } else { | ||
| nodesWithEmptyRevision = true | ||
| } | ||
| } | ||
|
|
||
| if nodesWithEmptyRevision { | ||
| return false, fmt.Errorf("determining observed revisions in kubeapiservers.operator.openshift.io/cluster; some nodes do not have a valid CurrentRevision") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we find one with an invalid revision, should we just return the error from within the loop, terminating it early?
As-is, I don't really see us gaining any benefit of continuing to loop once we've found at least one node with an invalid current revision.
| nodesWithEmptyRevision := false | |
| for _, nodeStatus := range kas.Status.NodeStatuses { | |
| observedRevisions.Insert(nodeStatus.CurrentRevision) | |
| if nodeStatus.CurrentRevision > 0 { | |
| observedRevisions.Insert(nodeStatus.CurrentRevision) | |
| } else { | |
| nodesWithEmptyRevision = true | |
| } | |
| } | |
| if nodesWithEmptyRevision { | |
| return false, fmt.Errorf("determining observed revisions in kubeapiservers.operator.openshift.io/cluster; some nodes do not have a valid CurrentRevision") | |
| for _, nodeStatus := range kas.Status.NodeStatuses { | |
| if nodeStatus.CurrentRevision <= 0 { | |
| return false, fmt.Errorf("determining observed revisions in kubeapiservers.operator.openshift.io/cluster; some nodes do not have a valid CurrentRevision") | |
| } | |
| observedRevisions.Insert(nodeStatus.CurrentRevision) | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course -- now that we don't use the count this is much better 👍
|
This PR is to solve the separate issue I saw in another test #798 (comment) . Pre-merge tested this and PR #801 together within the cluster-bot. #800 is already Step 1 At 10:51:14, the upgrade completed: Step 3 So the verification fails. @liouk |
|
Added debug logging to investigate the issue found by @xingxingxia. /hold |
fb40473 to
90f2f82
Compare
90f2f82 to
702bf57
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
pkg/controllers/common/external_oidc.go (1)
79-120: Use a verbose log level for the new debug statements. These[debug-801]messages now fire on every sync for each node and missing configmap at the default INFO verbosity, which will spam controller logs. Please gate them behind a higher verbosity level (e.g.klog.V(4)) or add an explicit verbosity check.- klog.Infof("[debug-801] node '%s' is on revision %d", nodeStatus.NodeName, nodeStatus.CurrentRevision) + klog.V(4).Infof("[debug-801] node '%s' is on revision %d", nodeStatus.NodeName, nodeStatus.CurrentRevision) @@ - klog.Infof("[debug-801] configmap auth-config-%d not found; informer HasSynced=%v", revision, c.kasNamespaceConfigMapsInformer.HasSynced()) + klog.V(4).Infof("[debug-801] configmap auth-config-%d not found; informer HasSynced=%v", revision, c.kasNamespaceConfigMapsInformer.HasSynced()) @@ - klog.Infof("[debug-801] configmap config-%d does not contain expected OIDC config", revision) + klog.V(4).Infof("[debug-801] configmap config-%d does not contain expected OIDC config", revision)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (2)
pkg/controllers/common/external_oidc.go(3 hunks)pkg/libs/endpointaccessible/endpoint_accessible_controller.go(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/libs/endpointaccessible/endpoint_accessible_controller.go
|
@liouk: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
No description provided.