Skip to content

multicluster operator: NodePool / StretchCluster controllers don't reconcile on a newly-added peer until the operator Deployment is restarted #1493

@david-yu

Description

@david-yu

Summary

When a 4th cluster is added to an existing 3-peer multicluster deployment, the new peer's operator pod comes up Running and joins the raft (often as StateLeader), but the NodePool and StretchCluster controllers never fire on that peer's local resources — they sit at conditions[*].status: Unknown with reason: NotReconciled / message: "Waiting for controller" indefinitely. A kubectl rollout restart deployment <operator> on the new peer immediately unblocks reconciliation. We've also observed a related variant of the same bug on the existing peers when raft elections drop.

This looks like a controller-runtime informer/cache initialization race against multicluster raft membership being settled — the controllers register before raft is ready, then never recover from the empty-cache state without a process restart.

Operator version: v26.2.1-beta.1 (multicluster build).

Reproduction

The full multicluster setup is captured at https://github.com/david-yu/redpanda-operator-stretch-beta — Demo B's failover-region capacity-injection flow exercises this path. Minimal repro:

  1. Stand up the 3-peer stretch cluster per Step 1–7 in the README:

    • 3 K8s clusters (rp-east, rp-west, rp-eu) with cross-region pod-IP routability
    • rpk k8s multicluster bootstrap --context rp-east --context rp-west --context rp-eu --namespace redpanda --loadbalancer
    • helm install <ctx> redpanda/operator --version 26.2.1-beta.1 --devel per cluster, with matching multicluster.peers and fullnameOverride: <ctx> values
    • Apply a StretchCluster and one NodePool per cluster; wait until the StretchCluster reports Ready=True / Healthy=True.
  2. Provision a 4th K8s cluster (rp-failover, separate region) with cross-region pod-IP + LB connectivity to the existing three.

  3. Re-run bootstrap with all 4 contexts (idempotent on the existing 3, generates fresh state for rp-failover):

    rpk k8s multicluster bootstrap \\
      --context rp-east --context rp-west --context rp-eu --context rp-failover \\
      --namespace redpanda --loadbalancer
    
  4. Render an rp-failover helm values file with multicluster.name: rp-failover and a 4-entry multicluster.peers block including all four LB addresses. Render matching 4-peer values for the existing three.

  5. helm install rp-failover redpanda/operator --version 26.2.1-beta.1 --devel -f values-rp-failover.yaml -n redpanda. The operator pod becomes 1/1 Running.

  6. helm upgrade each of the existing 3 with the new 4-peer values.

  7. Apply a StretchCluster and a NodePool (replicas: 2) on rp-failover:

    apiVersion: cluster.redpanda.com/v1alpha2
    kind: StretchCluster
    metadata: { name: redpanda, namespace: redpanda }
    spec: ...   # same shape as the existing 3 clusters
    ---
    apiVersion: cluster.redpanda.com/v1alpha2
    kind: NodePool
    metadata: { name: rp-failover, namespace: redpanda }
    spec:
      clusterRef: { group: cluster.redpanda.com, kind: StretchCluster, name: redpanda }
      replicas: 2
      image: { repository: redpandadata/redpanda, tag: v26.1.6 }
      services: { perPod: { remote: { enabled: true } } }

Observed (10+ minutes after step 7)

`rpk k8s multicluster status` reports the 4-peer mesh as fully healthy:

CLUSTER      OPERATOR  RAFT-STATE     LEADER       PEERS  UNHEALTHY  TLS  SECRETS
rp-east      Running   StateFollower  rp-failover  4      0          ok   ok
rp-west      Running   StateFollower  rp-failover  4      0          ok   ok
rp-eu        Running   StateFollower  rp-failover  4      0          ok   ok
rp-failover  Running   StateLeader    rp-failover  4      0          ok   ok

CROSS-CLUSTER:
  ✓ [unique-names] all node names are unique
  ✓ [peer-agreement] peer lists agree across all clusters
  ✓ [leader-agreement] leader agreement: rp-failover (term 4)
  ✓ [ca-consistency] all clusters share the same CA

…but the NodePool and StretchCluster on rp-failover are stuck:

$ kubectl --context rp-failover -n redpanda get nodepool
NAME          BOUND     DEPLOYED
rp-failover   Unknown   Unknown
$ kubectl --context rp-failover -n redpanda get nodepool rp-failover -o yaml
status:
  conditions:
  - lastTransitionTime: "1970-01-01T00:00:00Z"
    message: Waiting for controller
    reason: NotReconciled
    status: Unknown
    type: Bound
  - lastTransitionTime: "1970-01-01T00:00:00Z"
    message: Waiting for controller
    reason: NotReconciled
    status: Unknown
    type: Deployed
  - lastTransitionTime: "1970-01-01T00:00:00Z"
    message: Waiting for controller
    reason: NotReconciled
    status: Unknown
    type: Quiesced
  - lastTransitionTime: "1970-01-01T00:00:00Z"
    message: Waiting for controller
    reason: NotReconciled
    status: Unknown
    type: Stable
$ kubectl --context rp-failover -n redpanda get stretchcluster redpanda \\
  -o jsonpath='{range .status.conditions[*]}{.type}={.status}{\"\\n\"}{end}'
Ready=Unknown
Healthy=Unknown
LicenseValid=Unknown
ResourcesSynced=Unknown
ConfigurationApplied=Unknown
SpecSynced=Unknown
BootstrapUserSynced=Unknown
Quiesced=Unknown
Stable=Unknown

No StatefulSet is created and no broker pods exist. The operator's logs only show raft activity (vote requests, term advances, leader election); no Reconciler/NodePool or Reconciler/StretchCluster log lines.

Workaround

kubectl --context rp-failover -n redpanda rollout restart deployment rp-failover

Within ~30 s of the new operator pod coming up, the NodePool flips to BOUND=True / DEPLOYED=True, the StretchCluster transitions through Ready=False → True, the StatefulSet is created, broker pods come up, and they join the cluster as new IDs (5, 6).

Variant we hit on existing peers

The same flow at the raft-join layer: when the new peer is added, it sometimes stays StatePreCandidate indefinitely with the existing 3 reporting unhealthy peers: <new-peer> even though all 4 LB IPs are reachable in both directions and TLS verifies. A kubectl rollout restart deployment on the existing 3 operators (so they re-load the 4-peer config and re-handshake) makes the new peer election succeed within a few seconds. We hit this on a previous run of the same flow.

Expected

After helm install + helm upgrade on the new peer, the operator should reconcile its own local NodePool/StretchCluster resources without needing a manual restart. Same for raft join — when the existing peers' multicluster.peers is updated via helm upgrade, the new peer's election should converge without bouncing the existing operators.

Environment

  • Operator: redpanda/operator @ v26.2.1-beta.1 (chart) / multicluster build
  • Redpanda: v26.1.6
  • Kubernetes: AKS 1.34 (validated end-to-end on Azure: eastus / westus2 / centralus / eastus2 for failover) and GKE 1.35 RAPID
  • Both reproductions fresh-bootstrapped from terraform; no carried-over state.

What might help

  • A startup ordering check: don't mark operator pod Ready until raft membership has settled and the local controllers' caches have synced for at least one tick.
  • Defensive re-list of CRs after raft membership changes (or after the operator transitions between candidate/leader/follower for the first time).

Happy to provide additional logs / state dumps if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions