multicluster operator: NodePool / StretchCluster controllers don't reconcile on a newly-added peer until the operator Deployment is restarted

## Summary

When a 4th cluster is added to an existing 3-peer multicluster deployment, the new peer's operator pod comes up `Running` and joins the raft (often as `StateLeader`), but the NodePool and StretchCluster controllers never fire on that peer's local resources — they sit at `conditions[*].status: Unknown` with `reason: NotReconciled / message: "Waiting for controller"` indefinitely. A `kubectl rollout restart deployment <operator>` on the new peer immediately unblocks reconciliation. We've also observed a related variant of the same bug on the *existing* peers when raft elections drop.

This looks like a controller-runtime informer/cache initialization race against multicluster raft membership being settled — the controllers register before raft is ready, then never recover from the empty-cache state without a process restart.

Operator version: **`v26.2.1-beta.1`** (multicluster build).

## Reproduction

The full multicluster setup is captured at https://github.com/david-yu/redpanda-operator-stretch-beta — Demo B's failover-region capacity-injection flow exercises this path. Minimal repro:

1. Stand up the 3-peer stretch cluster per [Step 1–7 in the README](https://github.com/david-yu/redpanda-operator-stretch-beta#step-by-step):
   - 3 K8s clusters (`rp-east`, `rp-west`, `rp-eu`) with cross-region pod-IP routability
   - `rpk k8s multicluster bootstrap --context rp-east --context rp-west --context rp-eu --namespace redpanda --loadbalancer`
   - `helm install <ctx> redpanda/operator --version 26.2.1-beta.1 --devel` per cluster, with matching `multicluster.peers` and `fullnameOverride: <ctx>` values
   - Apply a `StretchCluster` and one `NodePool` per cluster; wait until the StretchCluster reports `Ready=True / Healthy=True`.

2. Provision a 4th K8s cluster (`rp-failover`, separate region) with cross-region pod-IP + LB connectivity to the existing three.

3. Re-run bootstrap with all 4 contexts (idempotent on the existing 3, generates fresh state for `rp-failover`):
   ```
   rpk k8s multicluster bootstrap \\
     --context rp-east --context rp-west --context rp-eu --context rp-failover \\
     --namespace redpanda --loadbalancer
   ```

4. Render an `rp-failover` helm values file with `multicluster.name: rp-failover` and a 4-entry `multicluster.peers` block including all four LB addresses. Render matching 4-peer values for the existing three.

5. `helm install rp-failover redpanda/operator --version 26.2.1-beta.1 --devel -f values-rp-failover.yaml -n redpanda`. The operator pod becomes `1/1 Running`.

6. `helm upgrade` each of the existing 3 with the new 4-peer values.

7. Apply a `StretchCluster` and a `NodePool` (`replicas: 2`) on `rp-failover`:
   ```yaml
   apiVersion: cluster.redpanda.com/v1alpha2
   kind: StretchCluster
   metadata: { name: redpanda, namespace: redpanda }
   spec: ...   # same shape as the existing 3 clusters
   ---
   apiVersion: cluster.redpanda.com/v1alpha2
   kind: NodePool
   metadata: { name: rp-failover, namespace: redpanda }
   spec:
     clusterRef: { group: cluster.redpanda.com, kind: StretchCluster, name: redpanda }
     replicas: 2
     image: { repository: redpandadata/redpanda, tag: v26.1.6 }
     services: { perPod: { remote: { enabled: true } } }
   ```

## Observed (10+ minutes after step 7)

\`rpk k8s multicluster status\` reports the 4-peer mesh as fully healthy:

```
CLUSTER      OPERATOR  RAFT-STATE     LEADER       PEERS  UNHEALTHY  TLS  SECRETS
rp-east      Running   StateFollower  rp-failover  4      0          ok   ok
rp-west      Running   StateFollower  rp-failover  4      0          ok   ok
rp-eu        Running   StateFollower  rp-failover  4      0          ok   ok
rp-failover  Running   StateLeader    rp-failover  4      0          ok   ok

CROSS-CLUSTER:
  ✓ [unique-names] all node names are unique
  ✓ [peer-agreement] peer lists agree across all clusters
  ✓ [leader-agreement] leader agreement: rp-failover (term 4)
  ✓ [ca-consistency] all clusters share the same CA
```

…but the NodePool and StretchCluster on `rp-failover` are stuck:

```
$ kubectl --context rp-failover -n redpanda get nodepool
NAME          BOUND     DEPLOYED
rp-failover   Unknown   Unknown
```

```
$ kubectl --context rp-failover -n redpanda get nodepool rp-failover -o yaml
status:
  conditions:
  - lastTransitionTime: "1970-01-01T00:00:00Z"
    message: Waiting for controller
    reason: NotReconciled
    status: Unknown
    type: Bound
  - lastTransitionTime: "1970-01-01T00:00:00Z"
    message: Waiting for controller
    reason: NotReconciled
    status: Unknown
    type: Deployed
  - lastTransitionTime: "1970-01-01T00:00:00Z"
    message: Waiting for controller
    reason: NotReconciled
    status: Unknown
    type: Quiesced
  - lastTransitionTime: "1970-01-01T00:00:00Z"
    message: Waiting for controller
    reason: NotReconciled
    status: Unknown
    type: Stable
```

```
$ kubectl --context rp-failover -n redpanda get stretchcluster redpanda \\
  -o jsonpath='{range .status.conditions[*]}{.type}={.status}{\"\\n\"}{end}'
Ready=Unknown
Healthy=Unknown
LicenseValid=Unknown
ResourcesSynced=Unknown
ConfigurationApplied=Unknown
SpecSynced=Unknown
BootstrapUserSynced=Unknown
Quiesced=Unknown
Stable=Unknown
```

No StatefulSet is created and no broker pods exist. The operator's logs only show raft activity (vote requests, term advances, leader election); no `Reconciler/NodePool` or `Reconciler/StretchCluster` log lines.

## Workaround

```
kubectl --context rp-failover -n redpanda rollout restart deployment rp-failover
```

Within ~30 s of the new operator pod coming up, the NodePool flips to `BOUND=True / DEPLOYED=True`, the StretchCluster transitions through `Ready=False → True`, the StatefulSet is created, broker pods come up, and they join the cluster as new IDs (5, 6).

## Variant we hit on existing peers

The same flow at the **raft-join** layer: when the new peer is added, it sometimes stays `StatePreCandidate` indefinitely with the existing 3 reporting `unhealthy peers: <new-peer>` even though all 4 LB IPs are reachable in both directions and TLS verifies. A `kubectl rollout restart deployment` on the **existing** 3 operators (so they re-load the 4-peer config and re-handshake) makes the new peer election succeed within a few seconds. We hit this on a previous run of the same flow.

## Expected

After `helm install` + `helm upgrade` on the new peer, the operator should reconcile its own local NodePool/StretchCluster resources without needing a manual restart. Same for raft join — when the existing peers' `multicluster.peers` is updated via `helm upgrade`, the new peer's election should converge without bouncing the existing operators.

## Environment

- Operator: `redpanda/operator @ v26.2.1-beta.1` (chart) / multicluster build
- Redpanda: `v26.1.6`
- Kubernetes: AKS 1.34 (validated end-to-end on Azure: eastus / westus2 / centralus / eastus2 for failover) and GKE 1.35 RAPID
- Both reproductions fresh-bootstrapped from terraform; no carried-over state.

## What might help

- A startup ordering check: don't mark operator pod Ready until raft membership has settled and the local controllers' caches have synced for at least one tick.
- Defensive re-list of CRs after raft membership changes (or after the operator transitions between candidate/leader/follower for the first time).

Happy to provide additional logs / state dumps if useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multicluster operator: NodePool / StretchCluster controllers don't reconcile on a newly-added peer until the operator Deployment is restarted #1493

Summary

Reproduction

Observed (10+ minutes after step 7)

Workaround

Variant we hit on existing peers

Expected

Environment

What might help

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

multicluster operator: NodePool / StretchCluster controllers don't reconcile on a newly-added peer until the operator Deployment is restarted #1493

Description

Summary

Reproduction

Observed (10+ minutes after step 7)

Workaround

Variant we hit on existing peers

Expected

Environment

What might help

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions