operator: fix scale-up deadlock when cluster is unhealthy#1492
Open
operator: fix scale-up deadlock when cluster is unhealthy#1492
Conversation
…2896) The v1 operator had a deadlock where an unhealthy cluster could not be recovered by adding more broker pods. When status.Restarting=true (set during any rolling update), shouldUpdate() returns nodePoolRestarting=true on every reconcile, causing runUpdate() to run unconditionally. runUpdate() gates on isClusterHealthy() before performing a rolling update, returning a RequeueAfterError when unhealthy. This skipped handleScaling() entirely, preventing CurrentReplicas from being updated and blocking the scale-up. Fix: let handleScaling() run even when runUpdate() returns a RequeueAfterError. Scale-up (updating CurrentReplicas in status) has no dependency on cluster health — only decommission/scale-down, which is already gated inside handleScaling() itself, requires health checks. Rolling pod template updates remain gated on cluster health as before. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Author
|
@claude review plz |
…NFRA-2896) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
status.restarting=true(set during any rolling update),shouldUpdate()forcesrunUpdate()to run on every reconcile;runUpdate()callsisClusterHealthy()before performing a rolling update and returns aRequeueAfterErrorif unhealthy — this skippedhandleScaling()entirely, preventingCurrentReplicasfrom being updated and blocking the scale-uphandleScaling()run even whenrunUpdate()returns aRequeueAfterError; scale-up has no dependency on cluster health — only decommission/scale-down (already gated insidehandleScaling()itself) requires health checksJira: CIAINFRA-2896
Test plan
"Should scale up a node pool even when cluster is unhealthy and in a restarting state"— confirmed it fails before the fix (times out after 30s) and passes after (7s)task test:unit)slord769/redpanda-operator:ciainfra-2896built from this branch for post-fix validation)🤖 Generated with Claude Code