Skip to content

operator: fix scale-up deadlock when cluster is unhealthy#1492

Open
simonlord wants to merge 2 commits intomainfrom
sl/ciainfra-2896-fix-scaleup-deadlock
Open

operator: fix scale-up deadlock when cluster is unhealthy#1492
simonlord wants to merge 2 commits intomainfrom
sl/ciainfra-2896-fix-scaleup-deadlock

Conversation

@simonlord
Copy link
Copy Markdown

@simonlord simonlord commented Apr 29, 2026

Summary

  • The v1 operator had a deadlock where an unhealthy cluster could not recover by adding more broker pods
  • When status.restarting=true (set during any rolling update), shouldUpdate() forces runUpdate() to run on every reconcile; runUpdate() calls isClusterHealthy() before performing a rolling update and returns a RequeueAfterError if unhealthy — this skipped handleScaling() entirely, preventing CurrentReplicas from being updated and blocking the scale-up
  • Fix: let handleScaling() run even when runUpdate() returns a RequeueAfterError; scale-up has no dependency on cluster health — only decommission/scale-down (already gated inside handleScaling() itself) requires health checks
  • Rolling pod template updates remain gated on cluster health as before

Jira: CIAINFRA-2896

Test plan

  • Added failing test: "Should scale up a node pool even when cluster is unhealthy and in a restarting state" — confirmed it fails before the fix (times out after 30s) and passes after (7s)
  • Full vectorized controller test suite passes (34/34 specs)
  • Full unit test suite passes (task test:unit)
  • Manually reproduced the deadlock on a running cluster using the pre-fix image (slord769/redpanda-operator:ciainfra-2896 built from this branch for post-fix validation)

🤖 Generated with Claude Code

…2896)

The v1 operator had a deadlock where an unhealthy cluster could not be
recovered by adding more broker pods. When status.Restarting=true (set
during any rolling update), shouldUpdate() returns nodePoolRestarting=true
on every reconcile, causing runUpdate() to run unconditionally. runUpdate()
gates on isClusterHealthy() before performing a rolling update, returning a
RequeueAfterError when unhealthy. This skipped handleScaling() entirely,
preventing CurrentReplicas from being updated and blocking the scale-up.

Fix: let handleScaling() run even when runUpdate() returns a
RequeueAfterError. Scale-up (updating CurrentReplicas in status) has no
dependency on cluster health — only decommission/scale-down, which is
already gated inside handleScaling() itself, requires health checks.

Rolling pod template updates remain gated on cluster health as before.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 29, 2026

CLA assistant check
All committers have signed the CLA.

@simonlord
Copy link
Copy Markdown
Author

@claude review plz

…NFRA-2896)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants