Skip to content

StretchCluster: add ghost node ejection acceptance test#1459

Merged
hidalgopl merged 1 commit intomainfrom
pb/ghost-node-ejection-acceptance-test
Apr 22, 2026
Merged

StretchCluster: add ghost node ejection acceptance test#1459
hidalgopl merged 1 commit intomainfrom
pb/ghost-node-ejection-acceptance-test

Conversation

@hidalgopl
Copy link
Copy Markdown
Contributor

@hidalgopl hidalgopl commented Apr 17, 2026

Summary

Adds a multicluster acceptance test that verifies Redpanda's ghost node ejection — the automatic decommission of a broker whose region becomes permanently unreachable. Deletes the k3d agent hosting one region and asserts that the surviving regions' rpk cluster health eventually reports 2 of 3 brokers.

What's in the change

  • New feature: stretch-cluster-ghost-node-ejection.feature — 3-region stretch cluster with aggressive autobalancing timeouts (avail=30s, auto-decom=60s) on Redpanda v26.1.5. Tagged @serial since it destroys infrastructure.
  • New steps (stretch_ghost_node_ejection.go): discover the controller region via rpk, pick a non-controller region to take offline via k3d node delete, and poll the remaining regions for the reduced node count. Cleanup deletes the stale NotReady Node object.
  • Framework changes (stretch.go): pin each vcluster's workloads to a single k3d agent via sync.fromHost.nodes.selector.labels: kubernetes.io/hostname + virtual scheduler. Needed so k3d node delete atomically removes the broker — a partial outage doesn't trigger Redpanda's auto-decommission path in our vcluster setup.
  • Supporting: pre-pull v26.1.5, loft-sh/kubernetes:v1.33.4, and cert-manager v1.17.1 images in the Taskfile.

🤖 Generated with Claude Code

@hidalgopl hidalgopl marked this pull request as draft April 17, 2026 09:32
@hidalgopl hidalgopl force-pushed the pb/ghost-node-ejection-acceptance-test branch 3 times, most recently from 006f17c to ea82278 Compare April 17, 2026 12:13
@hidalgopl hidalgopl marked this pull request as ready for review April 17, 2026 12:45
Copy link
Copy Markdown
Contributor

@andrewstucki andrewstucki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the Github UI is having a hard time loading the new steps file for me, but I took a look directly in your branch. A few things:

  1. Would it be possible to just simulate a regional outage via disconnecting the vcluster instance and killing the broker pods like I did in the region-killing test? That would make it so that you could run this test in parallel with the other tests rather than having to physically tear down a worker node.
  2. If you still want to do the worker node teardown, I believe there are a bunch of helpers in the harpoon framework for doing so without issuing k3d commands directly in case we ever get around to implementing the providers for running the tests against real cloud infrastructure. Take a look at some of the pvc unbinder tests that leverage the same sort of behavior and re-use what you can from there if need be.

@andrewstucki
Copy link
Copy Markdown
Contributor

Github is having issues with me trying to edit my comments as well, but we could also potentially just merge this test with the regional outage test if we want -- just add the configuration for ejection and the verification step before bringing the region back online.

node after the configured timeouts elapse.

The cluster is configured with aggressive timeouts:
- partition_autobalancing_node_availability_timeout_sec: 30
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be careful scaling these parameters. partition balancer is an old component that makes a lot of implicit assumptions regarding the relative scale of these parameters.

I crystallized the guidance in validators.cc which is as follows:

Rules:

node_status_interval must be less than
partition_autobalancing_tick_interval_ms

node_status_interval * 7 must be less than
partition_autobalancing_node_availability_timeout_sec

health_monitor_tick_interval must be less than
partition_autobalancing_tick_interval_ms

partition_autobalancing_node_availability_timeout_sec must be less than
partition_autobalancing_node_autodecommission_timeout_sec

my recommendation is that if you're going to scale one of the parameters, try to scale all of them accoringly

for your situation:
have node status be 5 seconds or so
have partition balancer tick interval be 10s or so
have node availability be 45 seconds or so
have node autodecommission be 90s or so

Total ejection time is still 90s here, because the underlying logic doesn't check node unavailable and then start the timer on auto ejection, its just 90s after a quorum agrees that the node has been missing too long.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! adjusted parameters as you suggested

Adds a multicluster acceptance test that verifies Redpanda's ghost node
ejection — the automatic decommission of a broker whose region becomes
permanently unreachable.

Simulates a regional outage by shutting down (via harpoon's provider-
agnostic `t.ShutdownNode`) the host worker node that the vcluster's
workloads are pinned to. Harpoon cleanly re-adds the node with the same
name and reloads the imported images onto it during cleanup, so
subsequent scenarios see the cluster at its original size.

## Response to review feedback

On @andrewstucki's suggestions:

- Using the existing `takeRegionOffline` (scale vcluster StatefulSet to 0
  + delete synced pods) did NOT reliably trigger Redpanda's
  auto-decommission path in our vcluster-on-k3d setup. The partition
  balancer saw the broker as unavailable but its voting consistently
  produced empty decommission candidates — even with Joe's scaled
  timeouts. Tearing down the host worker node is the most faithful
  simulation of a regional outage and the one that reliably triggers
  auto-decom, so this test keeps the node-teardown approach.
- Switched the raw `k3d node delete` + `k3d node create` machinery to
  harpoon's `t.ShutdownNode`, which uses the Provider interface
  (DeleteNode / AddNode / LoadImages) so the test remains
  provider-agnostic for future cloud-infra runs.
- Not merged into the regional-outage scenario: that scenario later
  brings the region back online and expects the operator in the returned
  region to resume reconciling. After an auto-decommission the returned
  broker's old identity is gone, so restoring would require a fresh
  operator deploy and re-registration — meaningfully different from the
  "transient outage" flow that scenario tests. Kept as a separate
  feature.

On @joe-redpanda's suggestion:

- Scaled node_status_interval, health_monitor_tick_interval,
  partition_autobalancing_tick_interval_ms,
  partition_autobalancing_node_availability_timeout_sec and
  partition_autobalancing_node_autodecommission_timeout_sec together
  per validators.cc rules. Total expected ejection time is ~90s from
  the moment the node goes down.

## Supporting changes

- `stretch.go`: pin each vcluster's workloads (control plane + synced
  pods) to a single host worker node via
  `sync.fromHost.nodes.selector.labels: kubernetes.io/hostname` +
  virtual scheduler. Needed so shutting down the host node atomically
  kills the broker — a partial outage doesn't trigger the
  auto-decommission voting path.
- `stretch.go`: skip offline regions in `ApplyAll`, `DeleteAll`,
  `DeleteNodepools` to avoid cleanup errors on unreachable vclusters.
- `Taskfile.yml`: pre-pull `redpandadata/redpanda:v26.1.5`,
  `ghcr.io/loft-sh/kubernetes:v1.33.4`, and cert-manager `v1.17.1`
  images so the test doesn't hit dockerhub rate limits on first run.
- `acceptance/main_test.go`: add `v26.1.5` to `WithImportedImages` so
  k3d loads it into the cluster.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hidalgopl hidalgopl force-pushed the pb/ghost-node-ejection-acceptance-test branch from ea82278 to 3fa9721 Compare April 20, 2026 09:16
@hidalgopl
Copy link
Copy Markdown
Contributor Author

So the Github UI is having a hard time loading the new steps file for me, but I took a look directly in your branch. A few things:

  1. Would it be possible to just simulate a regional outage via disconnecting the vcluster instance and killing the broker pods like I did in the region-killing test? That would make it so that you could run this test in parallel with the other tests rather than having to physically tear down a worker node.
  2. If you still want to do the worker node teardown, I believe there are a bunch of helpers in the harpoon framework for doing so without issuing k3d commands directly in case we ever get around to implementing the providers for running the tests against real cloud infrastructure. Take a look at some of the pvc unbinder tests that leverage the same sort of behavior and re-use what you can from there if need be.
  1. Yes, I tried that initially, but couldn't make the decommission work this way. I suspect it is due to k8s service for broker being still there and having publishNotReadyAddresses: true, but didn't dig deeper into it. Tried various ways to make it work without removing k3s node, but none of them worked for me.

  2. Good point, I made changes to use harpoon's helpers. Also, decided not to re-add the node back for now. I marked the test as @serial with an assumption that it will run as the last one in the suite. Initially I thought about re-adding it, but this would also require loading all docker images on the node again + potentially re-deploying the operator / vCluster there. I believe the most pragmatic solution for now would be just not re-adding it and using @serial.

  3. I also tried to merge it with regional outage test - however:
    a) The autodecommission doesn't work with the current takeRegionDown implementation.
    b) If we would change the takeRegionDown to remove k3s node there, it means that later on we'd need to re-add and re-deploy everything, which adds up a lot of time to the suite.

cc @andrewstucki

@hidalgopl hidalgopl requested a review from andrewstucki April 20, 2026 12:50
Copy link
Copy Markdown
Contributor

@andrewstucki andrewstucki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for now -- eventually we may want to remove the "k3dNode" field naming, etc. from this rather than just "nodeName" since we're technically not guaranteed to be running this against k3d, but I'm fine if we change that later, if you want to, feel free to merge or just swap the field name and then merge.

@hidalgopl hidalgopl merged commit bfcb5e6 into main Apr 22, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants