StretchCluster: add ghost node ejection acceptance test#1459
Conversation
006f17c to
ea82278
Compare
andrewstucki
left a comment
There was a problem hiding this comment.
So the Github UI is having a hard time loading the new steps file for me, but I took a look directly in your branch. A few things:
- Would it be possible to just simulate a regional outage via disconnecting the vcluster instance and killing the broker pods like I did in the region-killing test? That would make it so that you could run this test in parallel with the other tests rather than having to physically tear down a worker node.
- If you still want to do the worker node teardown, I believe there are a bunch of helpers in the harpoon framework for doing so without issuing k3d commands directly in case we ever get around to implementing the providers for running the tests against real cloud infrastructure. Take a look at some of the pvc unbinder tests that leverage the same sort of behavior and re-use what you can from there if need be.
|
Github is having issues with me trying to edit my comments as well, but we could also potentially just merge this test with the regional outage test if we want -- just add the configuration for ejection and the verification step before bringing the region back online. |
| node after the configured timeouts elapse. | ||
|
|
||
| The cluster is configured with aggressive timeouts: | ||
| - partition_autobalancing_node_availability_timeout_sec: 30 |
There was a problem hiding this comment.
Be careful scaling these parameters. partition balancer is an old component that makes a lot of implicit assumptions regarding the relative scale of these parameters.
I crystallized the guidance in validators.cc which is as follows:
Rules:
node_status_interval must be less than
partition_autobalancing_tick_interval_ms
node_status_interval * 7 must be less than
partition_autobalancing_node_availability_timeout_sec
health_monitor_tick_interval must be less than
partition_autobalancing_tick_interval_ms
partition_autobalancing_node_availability_timeout_sec must be less than
partition_autobalancing_node_autodecommission_timeout_sec
my recommendation is that if you're going to scale one of the parameters, try to scale all of them accoringly
for your situation:
have node status be 5 seconds or so
have partition balancer tick interval be 10s or so
have node availability be 45 seconds or so
have node autodecommission be 90s or so
Total ejection time is still 90s here, because the underlying logic doesn't check node unavailable and then start the timer on auto ejection, its just 90s after a quorum agrees that the node has been missing too long.
There was a problem hiding this comment.
thanks! adjusted parameters as you suggested
Adds a multicluster acceptance test that verifies Redpanda's ghost node ejection — the automatic decommission of a broker whose region becomes permanently unreachable. Simulates a regional outage by shutting down (via harpoon's provider- agnostic `t.ShutdownNode`) the host worker node that the vcluster's workloads are pinned to. Harpoon cleanly re-adds the node with the same name and reloads the imported images onto it during cleanup, so subsequent scenarios see the cluster at its original size. ## Response to review feedback On @andrewstucki's suggestions: - Using the existing `takeRegionOffline` (scale vcluster StatefulSet to 0 + delete synced pods) did NOT reliably trigger Redpanda's auto-decommission path in our vcluster-on-k3d setup. The partition balancer saw the broker as unavailable but its voting consistently produced empty decommission candidates — even with Joe's scaled timeouts. Tearing down the host worker node is the most faithful simulation of a regional outage and the one that reliably triggers auto-decom, so this test keeps the node-teardown approach. - Switched the raw `k3d node delete` + `k3d node create` machinery to harpoon's `t.ShutdownNode`, which uses the Provider interface (DeleteNode / AddNode / LoadImages) so the test remains provider-agnostic for future cloud-infra runs. - Not merged into the regional-outage scenario: that scenario later brings the region back online and expects the operator in the returned region to resume reconciling. After an auto-decommission the returned broker's old identity is gone, so restoring would require a fresh operator deploy and re-registration — meaningfully different from the "transient outage" flow that scenario tests. Kept as a separate feature. On @joe-redpanda's suggestion: - Scaled node_status_interval, health_monitor_tick_interval, partition_autobalancing_tick_interval_ms, partition_autobalancing_node_availability_timeout_sec and partition_autobalancing_node_autodecommission_timeout_sec together per validators.cc rules. Total expected ejection time is ~90s from the moment the node goes down. ## Supporting changes - `stretch.go`: pin each vcluster's workloads (control plane + synced pods) to a single host worker node via `sync.fromHost.nodes.selector.labels: kubernetes.io/hostname` + virtual scheduler. Needed so shutting down the host node atomically kills the broker — a partial outage doesn't trigger the auto-decommission voting path. - `stretch.go`: skip offline regions in `ApplyAll`, `DeleteAll`, `DeleteNodepools` to avoid cleanup errors on unreachable vclusters. - `Taskfile.yml`: pre-pull `redpandadata/redpanda:v26.1.5`, `ghcr.io/loft-sh/kubernetes:v1.33.4`, and cert-manager `v1.17.1` images so the test doesn't hit dockerhub rate limits on first run. - `acceptance/main_test.go`: add `v26.1.5` to `WithImportedImages` so k3d loads it into the cluster. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ea82278 to
3fa9721
Compare
|
andrewstucki
left a comment
There was a problem hiding this comment.
LGTM for now -- eventually we may want to remove the "k3dNode" field naming, etc. from this rather than just "nodeName" since we're technically not guaranteed to be running this against k3d, but I'm fine if we change that later, if you want to, feel free to merge or just swap the field name and then merge.
Summary
Adds a multicluster acceptance test that verifies Redpanda's ghost node ejection — the automatic decommission of a broker whose region becomes permanently unreachable. Deletes the k3d agent hosting one region and asserts that the surviving regions'
rpk cluster healtheventually reports 2 of 3 brokers.What's in the change
stretch-cluster-ghost-node-ejection.feature— 3-region stretch cluster with aggressive autobalancing timeouts (avail=30s, auto-decom=60s) on Redpanda v26.1.5. Tagged@serialsince it destroys infrastructure.stretch_ghost_node_ejection.go): discover the controller region viarpk, pick a non-controller region to take offline viak3d node delete, and poll the remaining regions for the reduced node count. Cleanup deletes the staleNotReadyNode object.stretch.go): pin each vcluster's workloads to a single k3d agent viasync.fromHost.nodes.selector.labels: kubernetes.io/hostname+ virtual scheduler. Needed sok3d node deleteatomically removes the broker — a partial outage doesn't trigger Redpanda's auto-decommission path in our vcluster setup.v26.1.5,loft-sh/kubernetes:v1.33.4, and cert-managerv1.17.1images in the Taskfile.🤖 Generated with Claude Code