Skip to content

Fix various multicluster issues#1480

Merged
andrewstucki merged 4 commits intomainfrom
as/multicluster-resiliancy-and-bootstrap-fixes
Apr 24, 2026
Merged

Fix various multicluster issues#1480
andrewstucki merged 4 commits intomainfrom
as/multicluster-resiliancy-and-bootstrap-fixes

Conversation

@andrewstucki
Copy link
Copy Markdown
Contributor

A batch of fixes surfaced during end-to-end stretch-cluster partition testing (3-cluster demo across EKS/AKS/GKE over Tailscale + Cilium ClusterMesh), plus tooling improvements to rpk k8s multicluster that remove the chicken-and-egg bootstrap cycle when leveraging publicly accessible load balancers and fix the TLS-SAN health check.

Bug fixes

Hot reconcile loop on healthy stretch clusters

Two root causes produced a steady stream of .status writes at the condition-heartbeat cadence:

NOTE: this also exists in the current Redpanda controller and will need to be partially backported.

PodEndpoints dropped during cross-cluster owner rebuild

In flat-network mode, StretchClusterOwnershipResolver.ResolveOwnerReference rebuilds the owner wrapper when the owner UID comes from a different k8s cluster than the one being reconciled. The rebuild used NewStretchClusterWithPools, which carries forward NodePools but not PodEndpoints. Result: each SyncAll iteration that targeted a peer cluster would render flat-mode per-pod Endpoints with an empty IP list, and the Syncer would GC existing cross-cluster Endpoints/EndpointSlices every reconcile cycle.

Fixed by preserving PodEndpoints on the new owner in operator/internal/lifecycle/stretch_cluster_ownership.go.

Observability

Trace logging added along the flat-mode per-pod Endpoints path so the next time this regresses we can see it in the logs rather than correlating timestamps with kubectl get endpoints:

rpk k8s multicluster status — TLS SAN check correctness

operator/cmd/rpk-k8s/k8s/multicluster/checks/cluster_tls_san.go rewritten. The previous check substring-matched the cluster's logical name against cert.DNSNames (e.g., "two" against DNS SANs), so:

  • Clusters with DNS-only peer addresses could false-fail if the address didn't happen to contain the cluster name as a substring.
  • Clusters with IP-only peer addresses always failed because cert.IPAddresses wasn't consulted.

Now resolves the expected peer address from the Deployment's --peer=<self>://<addr>:9443 flag and calls x509.Certificate.VerifyHostname(addr) — the same routine Go's TLS stack uses during a real peer dial. Handles DNS wildcards, IP literals, and case normalization uniformly. Error messages include both DNSNames and IPAddresses so the cause is obvious without pulling the cert.

rpk k8s multicluster bootstrap --loadbalancer

New flag resolves the deploy/redeploy chicken-and-egg: previously, to get peer addresses into each cluster's cert SANs and --peer flags when leveraging an external load balancer, you'd install the operator, wait for the chart's LoadBalancer Service to provision, read the addresses, and redeploy with those values baked into helm values.

Now bootstrap --loadbalancer provisions a standalone peer-only LoadBalancer Service per cluster (name <fullname>-multicluster-peer, distinct from the chart's Service, labelled operator.redpanda.com/bootstrap-managed=true for discovery), waits for the provider to publish an address, and signs each cert with that address in the SANs. On completion it prints a ready-to-paste multicluster.peers block in helm-values shape.

New files:

Precedence when both --loadbalancer and --dns-override are specified: --dns-override wins for the cert SAN, LB provisioning is skipped for that cluster. If someone needs "always provision, but cert SAN from override" it's a follow-up that splits ServiceAddress into advertise-as vs cert-SAN fields.

Chart: multicluster.service addition.

We now have the ability to render out a service alongside each operator deployment that is configurable so that things like service meshes/MCS implementations can quickly figure out the discovery mechanisms for the operator. The service/MCS primitives are disabled by default, and if you want to have the operators communicate over a public network, the bootstrap process --loadbalancer flag should likely be used instead, though you can also combine provisioning a loadbalancer service here with something like external dns annotations to give it a pre-determined well-known hostname.

@andrewstucki
Copy link
Copy Markdown
Contributor Author

See the NOTE: this also exists in the current Redpanda controller and will need to be partially backported. -- I'll open up individual "backport" PRs for this fix to our current target branches.

@andrewstucki andrewstucki changed the title Fix multicluster issues Fix various multicluster issues Apr 24, 2026
Comment thread operator/chart/chart.go
Comment thread operator/chart/service.go
@andrewstucki andrewstucki enabled auto-merge (squash) April 24, 2026 12:52
@andrewstucki andrewstucki merged commit 4432ce5 into main Apr 24, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants