Summary
When multicluster.enabled: true (or any setup where the operator emits broker hostnames in the form <podname>.<headless-svc> — i.e. only two DNS labels), the operator-generated Certificate resource is issued with a SAN list whose wildcard entries have a single-label parent (*.redpanda, *.redpanda.svc). RFC 6125 §6.4.3 disallows wildcards on a single-label parent (analogous to *.com), and OpenSSL — which is what the Redpanda broker links against — enforces this. So even though brokers complete the TLS handshake, the RPC client drops the connection immediately afterwards because the server-cert hostname check fails:
verify error:num=62: hostname mismatch
In broker logs this surfaces as:
rpc - server.cc:175 - Error[applying protocol] remote address: <ip>:<port> -
std::__1::system_error (error OpenSSL:167772358,
Failed to establish SSL handshake:
[error:0A0000C6:SSL routines::packet length too long,
error:0A000139:SSL routines::record layer failure])
…on the listening side, and on the initiating side as a stream of cluster_bootstrap_info failed … rpc::errc:4 and Error dispatching socket write … Broken pipe. The brokers never form quorum and the StretchCluster stays Ready: False forever.
How to reproduce (minimal — no stretch / cross-cloud setup needed)
Single-cluster reproduction. The bug is in the operator's Certificate generation, not in the cross-cluster machinery — you only need multicluster mode toggled on, even if there's a single peer that points back at itself.
# 1. Install cert-manager (any recent version).
helm install cert-manager jetstack/cert-manager \
-n cert-manager --create-namespace \
--version v1.16.2 --set installCRDs=true
# 2. Install the operator in multicluster mode with a self-peer.
helm repo add redpanda-data https://charts.redpanda.com
helm install rp-self redpanda-data/operator -n redpanda --create-namespace \
--version 26.2.1-beta.1 \
--set fullnameOverride=rp-self \
--set crds.enabled=true \
--set multicluster.enabled=true \
--set multicluster.name=rp-self \
--set multicluster.apiServerExternalAddress=https://kubernetes.default.svc \
--set multicluster.peers[0].name=rp-self \
--set multicluster.peers[0].address=rp-self-multicluster-peer.redpanda.svc.cluster.local
# 3. Apply a StretchCluster with TLS on + cert-manager-issued CA.
cat <<EOF | kubectl apply -f -
apiVersion: cluster.redpanda.com/v1alpha2
kind: StretchCluster
metadata:
name: redpanda
namespace: redpanda
spec:
rbac: { enabled: true }
external: { enabled: false }
networking: { crossClusterMode: flat }
tls:
enabled: true
certs:
default:
caEnabled: true
applyInternalDNSNames: true
EOF
cat <<EOF | kubectl apply -f -
apiVersion: cluster.redpanda.com/v1alpha2
kind: NodePool
metadata:
name: rp-self
namespace: redpanda
spec:
clusterRef:
group: cluster.redpanda.com
kind: StretchCluster
name: redpanda
replicas: 3
image:
repository: redpandadata/redpanda
tag: v26.1.6
EOF
# 4. Inspect the generated Certificate's SANs.
kubectl -n redpanda get secret redpanda-default-cert -o jsonpath='{.data.tls\.crt}' \
| base64 -d \
| openssl x509 -noout -ext subjectAltName
# Expected output (this is the bug):
# X509v3 Subject Alternative Name: critical
# DNS:redpanda.redpanda.svc.cluster.local,
# DNS:redpanda.redpanda.svc,
# DNS:redpanda.redpanda,
# DNS:*.redpanda.redpanda.svc.cluster.local,
# DNS:*.redpanda.redpanda.svc,
# DNS:*.redpanda.redpanda,
# DNS:*.redpanda.svc.cluster.local,
# DNS:*.redpanda.svc,
# DNS:*.redpanda ← single-label parent, RFC-6125 violation
#
# 5. Try strict hostname verification against the hostname brokers actually advertise.
ADVERTISED=redpanda-rp-self-0.redpanda
echo Q | openssl s_client \
-connect ${ADVERTISED}:33145 \
-servername ${ADVERTISED} \
-CAfile <(kubectl -n redpanda get secret redpanda-default-cert -o jsonpath='{.data.ca\.crt}' | base64 -d) \
-verify_hostname ${ADVERTISED} \
2>&1 | grep -i 'verify error\|hostname'
# Expected:
# verify error:num=62: hostname mismatch
# Verify return code: 62 (hostname mismatch)
In stretch / multicluster mode, the brokers never reach Ready because every cluster_bootstrap_info RPC handshake hits this. In single-cluster non-multicluster mode the operator emits broker FQDNs with more labels (redpanda-0.redpanda.<ns>.svc.cluster.local) where the wildcard SAN's parent is at least 4 labels deep, so OpenSSL accepts it and the bug is hidden.
Root cause
Two things conspire:
- The operator's
flat cross-cluster networking mode writes both seed_servers[].host.address and advertised_rpc_api.address as <podname>.<headless-svc> (2 labels) so the same hostname resolves identically on every cluster. There is no namespace/svc/cluster-local suffix.
- The operator's
applyInternalDNSNames: true cert-template emits wildcard SANs that mirror the headless-Service hierarchy (*.redpanda, *.redpanda.svc, *.redpanda.svc.cluster.local, etc.). The shortest of those — *.redpanda — is the only one with a parent that matches the 2-label advertised hostname, but its parent is a single label and OpenSSL refuses the match.
So the SANs the operator emits don't actually cover the hostname the operator's own RPC clients connect to.
Proposed fix: add explicit per-broker SANs to the Certificate template
When the operator generates the broker Certificate, in addition to the headless-Service wildcards it should add one explicit DNS SAN per broker matching what's written into advertised_rpc_api.address / seed_servers[].host.address. Concretely:
# Today (the bug):
dnsNames:
- redpanda.redpanda.svc.cluster.local
- redpanda.redpanda.svc
- redpanda.redpanda
- '*.redpanda.svc.cluster.local'
- '*.redpanda.svc'
- '*.redpanda' # ← unusable
# Proposed: also include the explicit per-broker hostnames (one per broker
# pod in this NodePool, plus one per peer broker in cross-cluster mode):
- redpanda-rp-self-0.redpanda
- redpanda-rp-self-1.redpanda
- redpanda-rp-self-2.redpanda
Because the operator already knows every broker's pod-name (the StatefulSet ordinal × NodePool name × Cluster name) at reconcile time, it has all the information needed to enumerate these SANs — same place it already builds the seed_servers list.
This costs one extra DNS SAN per broker on a Cert that's regenerated infrequently — bounded and small.
Backwards compatibility (non-stretch / pre-26.2 broker)
The change is purely additive on the Certificate's dnsNames list and doesn't require any new field on the StretchCluster / Redpanda CR or any new behavior in the broker. Concretely:
- Existing wildcards keep working. A non-stretch single-cluster deployment uses brokers that advertise their full FQDN (
redpanda-0.redpanda.<ns>.svc.cluster.local — 5 labels). Wildcard SANs like *.redpanda.<ns>.svc.cluster.local (4-label parent) match those fine; that path is unaffected. We're only adding more SANs alongside the existing ones, not replacing them.
- No broker-side config change required. A 26.1 (non-stretch) broker reading a cert with three extra DNS SANs verifies it the same way it does today — OpenSSL just sees a longer SAN list and walks it linearly until something matches the connection hostname. There's no new TLS config knob being introduced.
- Reconcile site is the same one that already enumerates brokers. The operator code that builds
seed_servers / advertised_rpc_api runs on every Redpanda and StretchCluster reconcile, including non-multicluster Redpanda CRs. Adding a parallel loop that emits one explicit SAN per pod doesn't open any new code path — it consumes an enumeration that's already produced.
- Explicitly handles the
multicluster.enabled + non-stretch combination. The minimal repro at the top of this issue is a single-cluster deployment with multicluster.enabled: true and a self-peer — non-stretch, but still hits the bug. The fix lands them in the working state too without any flag flip.
Net effect: existing non-stretch deployments see a cert with a few additional well-formed SANs; nothing changes for verification of the hostnames they were already verifying. The bug is fixed for stretch / multicluster / crossClusterMode: flat setups where the broker advertises a 2-label hostname today.
Workaround in the meantime
We're running with spec.tls.enabled: false at the StretchCluster layer plus explicit spec.listeners.{kafka,admin,http,schemaRegistry,rpc}.tls.enabled: false on every listener (the chart still emits kafka_api_tls etc. against nonexistent cert paths if listener TLS isn't disabled per-listener). Confidentiality is provided by the underlay — in our cross-cloud scaffold, IPsec VPN tunnels between clouds — but a single-cluster non-stretch deployment doesn't have that fallback.
Environment
- Operator chart:
redpanda-data/operator @ 26.2.1-beta.1
- Redpanda image:
redpandadata/redpanda:v26.1.6
- cert-manager:
v1.16.2
- Tested on EKS 1.34, GKE 1.35.3, AKS 1.34 — same behavior on all three, doesn't depend on the K8s flavor.
- Kubernetes / OpenSSL versions in the broker image follow upstream defaults; OpenSSL 3.x is strict about RFC 6125 single-label parents (3.0+ deprecated and 3.1+ rejects them by default).
cc: scaffold for full reproduction at https://github.com/david-yu/redpanda-operator-stretch-cross-cloud-beta (cross-cloud) — the same-cloud variant https://github.com/david-yu/redpanda-operator-stretch-beta appears to work because the broker pods land on the same Kubernetes context's local DNS where the chart uses longer FQDNs; the cross-cluster flat mode short-form is the trigger.
Summary
When
multicluster.enabled: true(or any setup where the operator emits broker hostnames in the form<podname>.<headless-svc>— i.e. only two DNS labels), the operator-generatedCertificateresource is issued with a SAN list whose wildcard entries have a single-label parent (*.redpanda,*.redpanda.svc). RFC 6125 §6.4.3 disallows wildcards on a single-label parent (analogous to*.com), and OpenSSL — which is what the Redpanda broker links against — enforces this. So even though brokers complete the TLS handshake, the RPC client drops the connection immediately afterwards because the server-cert hostname check fails:In broker logs this surfaces as:
…on the listening side, and on the initiating side as a stream of
cluster_bootstrap_info failed … rpc::errc:4andError dispatching socket write … Broken pipe. The brokers never form quorum and the StretchCluster staysReady: Falseforever.How to reproduce (minimal — no stretch / cross-cloud setup needed)
Single-cluster reproduction. The bug is in the operator's Certificate generation, not in the cross-cluster machinery — you only need multicluster mode toggled on, even if there's a single peer that points back at itself.
In stretch / multicluster mode, the brokers never reach
Readybecause every cluster_bootstrap_info RPC handshake hits this. In single-cluster non-multicluster mode the operator emits broker FQDNs with more labels (redpanda-0.redpanda.<ns>.svc.cluster.local) where the wildcard SAN's parent is at least 4 labels deep, so OpenSSL accepts it and the bug is hidden.Root cause
Two things conspire:
flatcross-cluster networking mode writes bothseed_servers[].host.addressandadvertised_rpc_api.addressas<podname>.<headless-svc>(2 labels) so the same hostname resolves identically on every cluster. There is no namespace/svc/cluster-local suffix.applyInternalDNSNames: truecert-template emits wildcard SANs that mirror the headless-Service hierarchy (*.redpanda,*.redpanda.svc,*.redpanda.svc.cluster.local, etc.). The shortest of those —*.redpanda— is the only one with a parent that matches the 2-label advertised hostname, but its parent is a single label and OpenSSL refuses the match.So the SANs the operator emits don't actually cover the hostname the operator's own RPC clients connect to.
Proposed fix: add explicit per-broker SANs to the Certificate template
When the operator generates the broker Certificate, in addition to the headless-Service wildcards it should add one explicit DNS SAN per broker matching what's written into
advertised_rpc_api.address/seed_servers[].host.address. Concretely:Because the operator already knows every broker's pod-name (the StatefulSet ordinal × NodePool name × Cluster name) at reconcile time, it has all the information needed to enumerate these SANs — same place it already builds the seed_servers list.
This costs one extra DNS SAN per broker on a Cert that's regenerated infrequently — bounded and small.
Backwards compatibility (non-stretch / pre-26.2 broker)
The change is purely additive on the Certificate's
dnsNameslist and doesn't require any new field on the StretchCluster / Redpanda CR or any new behavior in the broker. Concretely:redpanda-0.redpanda.<ns>.svc.cluster.local— 5 labels). Wildcard SANs like*.redpanda.<ns>.svc.cluster.local(4-label parent) match those fine; that path is unaffected. We're only adding more SANs alongside the existing ones, not replacing them.seed_servers/advertised_rpc_apiruns on every Redpanda and StretchCluster reconcile, including non-multicluster Redpanda CRs. Adding a parallel loop that emits one explicit SAN per pod doesn't open any new code path — it consumes an enumeration that's already produced.multicluster.enabled+ non-stretch combination. The minimal repro at the top of this issue is a single-cluster deployment withmulticluster.enabled: trueand a self-peer — non-stretch, but still hits the bug. The fix lands them in the working state too without any flag flip.Net effect: existing non-stretch deployments see a cert with a few additional well-formed SANs; nothing changes for verification of the hostnames they were already verifying. The bug is fixed for stretch / multicluster /
crossClusterMode: flatsetups where the broker advertises a 2-label hostname today.Workaround in the meantime
We're running with
spec.tls.enabled: falseat the StretchCluster layer plus explicitspec.listeners.{kafka,admin,http,schemaRegistry,rpc}.tls.enabled: falseon every listener (the chart still emitskafka_api_tlsetc. against nonexistent cert paths if listener TLS isn't disabled per-listener). Confidentiality is provided by the underlay — in our cross-cloud scaffold, IPsec VPN tunnels between clouds — but a single-cluster non-stretch deployment doesn't have that fallback.Environment
redpanda-data/operator @ 26.2.1-beta.1redpandadata/redpanda:v26.1.6v1.16.2cc: scaffold for full reproduction at https://github.com/david-yu/redpanda-operator-stretch-cross-cloud-beta (cross-cloud) — the same-cloud variant https://github.com/david-yu/redpanda-operator-stretch-beta appears to work because the broker pods land on the same Kubernetes context's local DNS where the chart uses longer FQDNs; the cross-cluster
flatmode short-form is the trigger.