operator: fix scale-up deadlock when cluster is unhealthy by simonlord · Pull Request #1492 · redpanda-data/redpanda-operator

simonlord · 2026-04-29T10:43:36Z

Summary

The v1 operator had a deadlock where an unhealthy cluster could not recover by adding more broker pods
When status.restarting=true (set during any rolling update), shouldUpdate() forces runUpdate() to run on every reconcile; runUpdate() calls isClusterHealthy() before performing a rolling update and returns a RequeueAfterError if unhealthy — this skipped handleScaling() entirely, preventing CurrentReplicas from being updated and blocking the scale-up
Fix: let handleScaling() run even when runUpdate() returns a RequeueAfterError; scale-up has no dependency on cluster health — only decommission/scale-down (already gated inside handleScaling() itself) requires health checks
Rolling pod template updates remain gated on cluster health as before

Test plan

Added failing test: "Should scale up a node pool even when cluster is unhealthy and in a restarting state" — confirmed it fails before the fix (times out after 30s) and passes after (7s)
Full vectorized controller test suite passes (34/34 specs)
Full unit test suite passes (task test:unit)
Manually reproduced the deadlock on a running cluster using the pre-fix image (slord769/redpanda-operator:ciainfra-2896 built from this branch for post-fix validation)

🤖 Generated with Claude Code

…2896) The v1 operator had a deadlock where an unhealthy cluster could not be recovered by adding more broker pods. When status.Restarting=true (set during any rolling update), shouldUpdate() returns nodePoolRestarting=true on every reconcile, causing runUpdate() to run unconditionally. runUpdate() gates on isClusterHealthy() before performing a rolling update, returning a RequeueAfterError when unhealthy. This skipped handleScaling() entirely, preventing CurrentReplicas from being updated and blocking the scale-up. Fix: let handleScaling() run even when runUpdate() returns a RequeueAfterError. Scale-up (updating CurrentReplicas in status) has no dependency on cluster health — only decommission/scale-down, which is already gated inside handleScaling() itself, requires health checks. Rolling pod template updates remain gated on cluster health as before. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

CLAassistant · 2026-04-29T10:43:42Z

All committers have signed the CLA.

simonlord · 2026-04-29T10:43:52Z

@claude review plz

…NFRA-2896) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

simonlord requested review from RafalKorepta, andrewstucki, chrisseto, gene-redpanda and hidalgopl as code owners April 29, 2026 10:43

operator: add changelog entry for scale-up deadlock fix (Part of CIAI…

5933779

…NFRA-2896) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

operator: fix scale-up deadlock when cluster is unhealthy#1492

operator: fix scale-up deadlock when cluster is unhealthy#1492
simonlord wants to merge 2 commits intomainfrom
sl/ciainfra-2896-fix-scaleup-deadlock

simonlord commented Apr 29, 2026 •

edited by atlassian Bot

Loading

Uh oh!

CLAassistant commented Apr 29, 2026 •

edited

Loading

Uh oh!

simonlord commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

simonlord commented Apr 29, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

CLAassistant commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simonlord commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

simonlord commented Apr 29, 2026 •

edited by atlassian Bot

Loading

CLAassistant commented Apr 29, 2026 •

edited

Loading