StretchCluster: add ghost node ejection acceptance test by hidalgopl · Pull Request #1459 · redpanda-data/redpanda-operator

hidalgopl · 2026-04-17T09:30:24Z

Summary

Adds a multicluster acceptance test that verifies Redpanda's ghost node ejection — the automatic decommission of a broker whose region becomes permanently unreachable. Deletes the k3d agent hosting one region and asserts that the surviving regions' rpk cluster health eventually reports 2 of 3 brokers.

What's in the change

New feature: stretch-cluster-ghost-node-ejection.feature — 3-region stretch cluster with aggressive autobalancing timeouts (avail=30s, auto-decom=60s) on Redpanda v26.1.5. Tagged @serial since it destroys infrastructure.
New steps (stretch_ghost_node_ejection.go): discover the controller region via rpk, pick a non-controller region to take offline via k3d node delete, and poll the remaining regions for the reduced node count. Cleanup deletes the stale NotReady Node object.
Framework changes (stretch.go): pin each vcluster's workloads to a single k3d agent via sync.fromHost.nodes.selector.labels: kubernetes.io/hostname + virtual scheduler. Needed so k3d node delete atomically removes the broker — a partial outage doesn't trigger Redpanda's auto-decommission path in our vcluster setup.
Supporting: pre-pull v26.1.5, loft-sh/kubernetes:v1.33.4, and cert-manager v1.17.1 images in the Taskfile.

🤖 Generated with Claude Code

andrewstucki

So the Github UI is having a hard time loading the new steps file for me, but I took a look directly in your branch. A few things:

Would it be possible to just simulate a regional outage via disconnecting the vcluster instance and killing the broker pods like I did in the region-killing test? That would make it so that you could run this test in parallel with the other tests rather than having to physically tear down a worker node.
If you still want to do the worker node teardown, I believe there are a bunch of helpers in the harpoon framework for doing so without issuing k3d commands directly in case we ever get around to implementing the providers for running the tests against real cloud infrastructure. Take a look at some of the pvc unbinder tests that leverage the same sort of behavior and re-use what you can from there if need be.

andrewstucki · 2026-04-17T15:02:51Z

Github is having issues with me trying to edit my comments as well, but we could also potentially just merge this test with the regional outage test if we want -- just add the configuration for ejection and the verification step before bringing the region back online.

joe-redpanda · 2026-04-17T18:10:17Z

+  node after the configured timeouts elapse.
+
+  The cluster is configured with aggressive timeouts:
+    - partition_autobalancing_node_availability_timeout_sec: 30


Be careful scaling these parameters. partition balancer is an old component that makes a lot of implicit assumptions regarding the relative scale of these parameters.

I crystallized the guidance in validators.cc which is as follows:

Rules:

node_status_interval must be less than partition_autobalancing_tick_interval_ms node_status_interval * 7 must be less than partition_autobalancing_node_availability_timeout_sec health_monitor_tick_interval must be less than partition_autobalancing_tick_interval_ms partition_autobalancing_node_availability_timeout_sec must be less than partition_autobalancing_node_autodecommission_timeout_sec

my recommendation is that if you're going to scale one of the parameters, try to scale all of them accoringly

for your situation:
have node status be 5 seconds or so
have partition balancer tick interval be 10s or so
have node availability be 45 seconds or so
have node autodecommission be 90s or so

Total ejection time is still 90s here, because the underlying logic doesn't check node unavailable and then start the timer on auto ejection, its just 90s after a quorum agrees that the node has been missing too long.

thanks! adjusted parameters as you suggested

@andrewstucki

Adds a multicluster acceptance test that verifies Redpanda's ghost node ejection — the automatic decommission of a broker whose region becomes permanently unreachable. Simulates a regional outage by shutting down (via harpoon's provider- agnostic `t.ShutdownNode`) the host worker node that the vcluster's workloads are pinned to. Harpoon cleanly re-adds the node with the same name and reloads the imported images onto it during cleanup, so subsequent scenarios see the cluster at its original size. ## Response to review feedback On @andrewstucki's suggestions: - Using the existing `takeRegionOffline` (scale vcluster StatefulSet to 0 + delete synced pods) did NOT reliably trigger Redpanda's auto-decommission path in our vcluster-on-k3d setup. The partition balancer saw the broker as unavailable but its voting consistently produced empty decommission candidates — even with Joe's scaled timeouts. Tearing down the host worker node is the most faithful simulation of a regional outage and the one that reliably triggers auto-decom, so this test keeps the node-teardown approach. - Switched the raw `k3d node delete` + `k3d node create` machinery to harpoon's `t.ShutdownNode`, which uses the Provider interface (DeleteNode / AddNode / LoadImages) so the test remains provider-agnostic for future cloud-infra runs. - Not merged into the regional-outage scenario: that scenario later brings the region back online and expects the operator in the returned region to resume reconciling. After an auto-decommission the returned broker's old identity is gone, so restoring would require a fresh operator deploy and re-registration — meaningfully different from the "transient outage" flow that scenario tests. Kept as a separate feature. On @joe-redpanda's suggestion: - Scaled node_status_interval, health_monitor_tick_interval, partition_autobalancing_tick_interval_ms, partition_autobalancing_node_availability_timeout_sec and partition_autobalancing_node_autodecommission_timeout_sec together per validators.cc rules. Total expected ejection time is ~90s from the moment the node goes down. ## Supporting changes - `stretch.go`: pin each vcluster's workloads (control plane + synced pods) to a single host worker node via `sync.fromHost.nodes.selector.labels: kubernetes.io/hostname` + virtual scheduler. Needed so shutting down the host node atomically kills the broker — a partial outage doesn't trigger the auto-decommission voting path. - `stretch.go`: skip offline regions in `ApplyAll`, `DeleteAll`, `DeleteNodepools` to avoid cleanup errors on unreachable vclusters. - `Taskfile.yml`: pre-pull `redpandadata/redpanda:v26.1.5`, `ghcr.io/loft-sh/kubernetes:v1.33.4`, and cert-manager `v1.17.1` images so the test doesn't hit dockerhub rate limits on first run. - `acceptance/main_test.go`: add `v26.1.5` to `WithImportedImages` so k3d loads it into the cluster. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

hidalgopl · 2026-04-20T12:49:54Z

So the Github UI is having a hard time loading the new steps file for me, but I took a look directly in your branch. A few things:

Would it be possible to just simulate a regional outage via disconnecting the vcluster instance and killing the broker pods like I did in the region-killing test? That would make it so that you could run this test in parallel with the other tests rather than having to physically tear down a worker node.

If you still want to do the worker node teardown, I believe there are a bunch of helpers in the harpoon framework for doing so without issuing k3d commands directly in case we ever get around to implementing the providers for running the tests against real cloud infrastructure. Take a look at some of the pvc unbinder tests that leverage the same sort of behavior and re-use what you can from there if need be.

Yes, I tried that initially, but couldn't make the decommission work this way. I suspect it is due to k8s service for broker being still there and having publishNotReadyAddresses: true, but didn't dig deeper into it. Tried various ways to make it work without removing k3s node, but none of them worked for me.
Good point, I made changes to use harpoon's helpers. Also, decided not to re-add the node back for now. I marked the test as @serial with an assumption that it will run as the last one in the suite. Initially I thought about re-adding it, but this would also require loading all docker images on the node again + potentially re-deploying the operator / vCluster there. I believe the most pragmatic solution for now would be just not re-adding it and using @serial.
I also tried to merge it with regional outage test - however:
a) The autodecommission doesn't work with the current takeRegionDown implementation.
b) If we would change the takeRegionDown to remove k3s node there, it means that later on we'd need to re-add and re-deploy everything, which adds up a lot of time to the suite.

cc @andrewstucki

andrewstucki

LGTM for now -- eventually we may want to remove the "k3dNode" field naming, etc. from this rather than just "nodeName" since we're technically not guaranteed to be running this against k3d, but I'm fine if we change that later, if you want to, feel free to merge or just swap the field name and then merge.

hidalgopl requested review from RafalKorepta, andrewstucki, chrisseto and gene-redpanda as code owners April 17, 2026 09:30

hidalgopl marked this pull request as draft April 17, 2026 09:32

hidalgopl force-pushed the pb/ghost-node-ejection-acceptance-test branch 3 times, most recently from 006f17c to ea82278 Compare April 17, 2026 12:13

hidalgopl added the no-changelog label Apr 17, 2026

hidalgopl marked this pull request as ready for review April 17, 2026 12:45

andrewstucki requested changes Apr 17, 2026

View reviewed changes

joe-redpanda reviewed Apr 17, 2026

View reviewed changes

hidalgopl force-pushed the pb/ghost-node-ejection-acceptance-test branch from ea82278 to 3fa9721 Compare April 20, 2026 09:16

hidalgopl requested a review from andrewstucki April 20, 2026 12:50

andrewstucki approved these changes Apr 21, 2026

View reviewed changes

hidalgopl merged commit bfcb5e6 into main Apr 22, 2026
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StretchCluster: add ghost node ejection acceptance test#1459

StretchCluster: add ghost node ejection acceptance test#1459
hidalgopl merged 1 commit intomainfrom
pb/ghost-node-ejection-acceptance-test

hidalgopl commented Apr 17, 2026 •

edited

Loading

Uh oh!

andrewstucki left a comment

Uh oh!

andrewstucki commented Apr 17, 2026

Uh oh!

joe-redpanda Apr 17, 2026

Uh oh!

hidalgopl Apr 20, 2026

Uh oh!

hidalgopl commented Apr 20, 2026

Uh oh!

andrewstucki left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hidalgopl commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in the change

Uh oh!

andrewstucki left a comment

Choose a reason for hiding this comment

Uh oh!

andrewstucki commented Apr 17, 2026

Uh oh!

joe-redpanda Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

hidalgopl Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

hidalgopl commented Apr 20, 2026

Uh oh!

andrewstucki left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hidalgopl commented Apr 17, 2026 •

edited

Loading