Skip to content

[FR] multicluster operator: wire the decommissioning controller (auto-PVC cleanup after broker decom) #1494

@david-yu

Description

@david-yu

Summary

The single-cluster operator binary (operator/cmd/run/run.go) ships two storage-cleanup controllers:

  • nodewatcherpvcunbinder.Controller (deletes PVCs of pods stuck Pending due to Node deletion)
  • decommissioningdecommissioning.Controller (handles broker decom + storage cleanup)

The multicluster operator binary (operator/cmd/multicluster/multicluster.go) only wires pvcunbinder.MulticlusterController. The decommissioning controller is not registered, so auto-decommission events leave behind orphaned PVCs (and, with reclaimPolicy: Delete storage classes, mostly-orphaned managed disks once the StatefulSet eventually shrinks too).

This feature request asks for the decommissioning controller to be wired into the multicluster operator the same way it is in the single-cluster operator, so that StretchCluster + NodePool deployments can rely on the operator for the same broker-lifecycle storage cleanup that single-cluster Redpanda chart users get out of the box.

Why it matters

Concrete failure mode we hit running https://github.com/david-yu/redpanda-operator-stretch-beta's Demo B (regional failure + failover-region capacity injection):

  1. Region down → cluster auto-decommissions broker IDs 0, 1 (their replicas drained onto the failover region's brokers).
  2. Region restored (via az aks stop / start, EC2 instance recovery, etc.). Node objects come back, the StatefulSet schedules pods on the new nodes, and the pods bind to the same datadir-redpanda-rp-east-{0,1} PVCs because Azure managed disks survive aks stop intact.
  3. The redpanda containers start, find their old node_uuid in /var/lib/redpanda/data, and try to rejoin as the previously-decommissioned IDs. The cluster rejects them:
    bad_rejoin: trying to rejoin with same ID and UUID as a decommissioned node
    
    Pods then loop indefinitely at 1/2 Running.

PVCUnbinder doesn't help here — it's keyed on pods stuck Pending (Node deletion + nodeAffinity break), but in aks stop/start the disks survive and the new pods immediately bind, so they never sit Pending. They just restart-loop on bad_rejoin.

The single-cluster decommissioning controller is the right component to handle this: when a broker is decommissioned (either by the cluster's own partition autobalancer, or by rpk redpanda admin brokers decommission), it should delete the broker's PVC. With reclaimPolicy: Delete on the StorageClass, the disk is reaped, and the next time the StatefulSet creates a pod with that ordinal, it gets a fresh PVC + disk + node UUID and joins as a new broker ID.

Workaround today

Manual two-step recovery whenever a region restores after auto-decom:

kubectl --context rp-<lost-region> -n redpanda delete pvc datadir-redpanda-<pool>-<id>
kubectl --context rp-<lost-region> -n redpanda delete pod redpanda-<pool>-<id> --grace-period=0 --force

Operationally fragile and not how the single-cluster path documents it.

Proposed change

Wire the existing operator/internal/controller/decommissioning/ controller into operator/cmd/multicluster/multicluster.go alongside the PVCUnbinder, with the same flag surface used by the single-cluster path (--additional-controllers=decommission style, or unconditionally on, matching whatever the rest of the multicluster controllers do). The chart's additionalCmdFlags already passes through, so users can opt in via helm values.

Specifically:

  • cmd/multicluster/multicluster.go: add an import of operator/internal/controller/decommissioning and a SetupWithMultiClusterManager (or equivalent) call alongside the existing PVCUnbinder block.
  • Multicluster RBAC: extend the operator's ClusterRole/Role to include the same verbs the single-cluster decommissioning controller needs on PVCs and PVs (the chart already has pvcunbinder.ClusterRole.yaml; sibling decommission.ClusterRole.yaml exists for the single-cluster path and should mirror).

If there's a reason the decommissioning controller isn't multicluster-safe today (e.g. it assumes a single-cluster Redpanda CR, not a StretchCluster + NodePool pair), happy to scope the work — would be useful to know what blockers exist.

Environment

Repo with full repro: https://github.com/david-yu/redpanda-operator-stretch-beta — Demo B exercises the auto-decommission + region restore flow.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions