Skip to content

Operator-issued broker TLS certs reject strict hostname verification on the advertised RPC hostname (*.redpanda violates RFC-6125) #1499

@david-yu

Description

@david-yu

Summary

When multicluster.enabled: true (or any setup where the operator emits broker hostnames in the form <podname>.<headless-svc> — i.e. only two DNS labels), the operator-generated Certificate resource is issued with a SAN list whose wildcard entries have a single-label parent (*.redpanda, *.redpanda.svc). RFC 6125 §6.4.3 disallows wildcards on a single-label parent (analogous to *.com), and OpenSSL — which is what the Redpanda broker links against — enforces this. So even though brokers complete the TLS handshake, the RPC client drops the connection immediately afterwards because the server-cert hostname check fails:

verify error:num=62: hostname mismatch

In broker logs this surfaces as:

rpc - server.cc:175 - Error[applying protocol] remote address: <ip>:<port> -
  std::__1::system_error (error OpenSSL:167772358,
    Failed to establish SSL handshake:
    [error:0A0000C6:SSL routines::packet length too long,
     error:0A000139:SSL routines::record layer failure])

…on the listening side, and on the initiating side as a stream of cluster_bootstrap_info failed … rpc::errc:4 and Error dispatching socket write … Broken pipe. The brokers never form quorum and the StretchCluster stays Ready: False forever.

How to reproduce (minimal — no stretch / cross-cloud setup needed)

Single-cluster reproduction. The bug is in the operator's Certificate generation, not in the cross-cluster machinery — you only need multicluster mode toggled on, even if there's a single peer that points back at itself.

# 1. Install cert-manager (any recent version).
helm install cert-manager jetstack/cert-manager \
  -n cert-manager --create-namespace \
  --version v1.16.2 --set installCRDs=true

# 2. Install the operator in multicluster mode with a self-peer.
helm repo add redpanda-data https://charts.redpanda.com
helm install rp-self redpanda-data/operator -n redpanda --create-namespace \
  --version 26.2.1-beta.1 \
  --set fullnameOverride=rp-self \
  --set crds.enabled=true \
  --set multicluster.enabled=true \
  --set multicluster.name=rp-self \
  --set multicluster.apiServerExternalAddress=https://kubernetes.default.svc \
  --set multicluster.peers[0].name=rp-self \
  --set multicluster.peers[0].address=rp-self-multicluster-peer.redpanda.svc.cluster.local

# 3. Apply a StretchCluster with TLS on + cert-manager-issued CA.
cat <<EOF | kubectl apply -f -
apiVersion: cluster.redpanda.com/v1alpha2
kind: StretchCluster
metadata:
  name: redpanda
  namespace: redpanda
spec:
  rbac: { enabled: true }
  external: { enabled: false }
  networking: { crossClusterMode: flat }
  tls:
    enabled: true
    certs:
      default:
        caEnabled: true
        applyInternalDNSNames: true
EOF

cat <<EOF | kubectl apply -f -
apiVersion: cluster.redpanda.com/v1alpha2
kind: NodePool
metadata:
  name: rp-self
  namespace: redpanda
spec:
  clusterRef:
    group: cluster.redpanda.com
    kind: StretchCluster
    name: redpanda
  replicas: 3
  image:
    repository: redpandadata/redpanda
    tag: v26.1.6
EOF

# 4. Inspect the generated Certificate's SANs.
kubectl -n redpanda get secret redpanda-default-cert -o jsonpath='{.data.tls\.crt}' \
  | base64 -d \
  | openssl x509 -noout -ext subjectAltName

# Expected output (this is the bug):
#   X509v3 Subject Alternative Name: critical
#       DNS:redpanda.redpanda.svc.cluster.local,
#       DNS:redpanda.redpanda.svc,
#       DNS:redpanda.redpanda,
#       DNS:*.redpanda.redpanda.svc.cluster.local,
#       DNS:*.redpanda.redpanda.svc,
#       DNS:*.redpanda.redpanda,
#       DNS:*.redpanda.svc.cluster.local,
#       DNS:*.redpanda.svc,
#       DNS:*.redpanda                       ← single-label parent, RFC-6125 violation
#
# 5. Try strict hostname verification against the hostname brokers actually advertise.
ADVERTISED=redpanda-rp-self-0.redpanda
echo Q | openssl s_client \
  -connect ${ADVERTISED}:33145 \
  -servername ${ADVERTISED} \
  -CAfile <(kubectl -n redpanda get secret redpanda-default-cert -o jsonpath='{.data.ca\.crt}' | base64 -d) \
  -verify_hostname ${ADVERTISED} \
  2>&1 | grep -i 'verify error\|hostname'

# Expected:
#   verify error:num=62: hostname mismatch
#   Verify return code: 62 (hostname mismatch)

In stretch / multicluster mode, the brokers never reach Ready because every cluster_bootstrap_info RPC handshake hits this. In single-cluster non-multicluster mode the operator emits broker FQDNs with more labels (redpanda-0.redpanda.<ns>.svc.cluster.local) where the wildcard SAN's parent is at least 4 labels deep, so OpenSSL accepts it and the bug is hidden.

Root cause

Two things conspire:

  1. The operator's flat cross-cluster networking mode writes both seed_servers[].host.address and advertised_rpc_api.address as <podname>.<headless-svc> (2 labels) so the same hostname resolves identically on every cluster. There is no namespace/svc/cluster-local suffix.
  2. The operator's applyInternalDNSNames: true cert-template emits wildcard SANs that mirror the headless-Service hierarchy (*.redpanda, *.redpanda.svc, *.redpanda.svc.cluster.local, etc.). The shortest of those — *.redpanda — is the only one with a parent that matches the 2-label advertised hostname, but its parent is a single label and OpenSSL refuses the match.

So the SANs the operator emits don't actually cover the hostname the operator's own RPC clients connect to.

Proposed fix: add explicit per-broker SANs to the Certificate template

When the operator generates the broker Certificate, in addition to the headless-Service wildcards it should add one explicit DNS SAN per broker matching what's written into advertised_rpc_api.address / seed_servers[].host.address. Concretely:

# Today (the bug):
dnsNames:
  - redpanda.redpanda.svc.cluster.local
  - redpanda.redpanda.svc
  - redpanda.redpanda
  - '*.redpanda.svc.cluster.local'
  - '*.redpanda.svc'
  - '*.redpanda'                         # ← unusable

# Proposed: also include the explicit per-broker hostnames (one per broker
# pod in this NodePool, plus one per peer broker in cross-cluster mode):
  - redpanda-rp-self-0.redpanda
  - redpanda-rp-self-1.redpanda
  - redpanda-rp-self-2.redpanda

Because the operator already knows every broker's pod-name (the StatefulSet ordinal × NodePool name × Cluster name) at reconcile time, it has all the information needed to enumerate these SANs — same place it already builds the seed_servers list.

This costs one extra DNS SAN per broker on a Cert that's regenerated infrequently — bounded and small.

Backwards compatibility (non-stretch / pre-26.2 broker)

The change is purely additive on the Certificate's dnsNames list and doesn't require any new field on the StretchCluster / Redpanda CR or any new behavior in the broker. Concretely:

  • Existing wildcards keep working. A non-stretch single-cluster deployment uses brokers that advertise their full FQDN (redpanda-0.redpanda.<ns>.svc.cluster.local — 5 labels). Wildcard SANs like *.redpanda.<ns>.svc.cluster.local (4-label parent) match those fine; that path is unaffected. We're only adding more SANs alongside the existing ones, not replacing them.
  • No broker-side config change required. A 26.1 (non-stretch) broker reading a cert with three extra DNS SANs verifies it the same way it does today — OpenSSL just sees a longer SAN list and walks it linearly until something matches the connection hostname. There's no new TLS config knob being introduced.
  • Reconcile site is the same one that already enumerates brokers. The operator code that builds seed_servers / advertised_rpc_api runs on every Redpanda and StretchCluster reconcile, including non-multicluster Redpanda CRs. Adding a parallel loop that emits one explicit SAN per pod doesn't open any new code path — it consumes an enumeration that's already produced.
  • Explicitly handles the multicluster.enabled + non-stretch combination. The minimal repro at the top of this issue is a single-cluster deployment with multicluster.enabled: true and a self-peer — non-stretch, but still hits the bug. The fix lands them in the working state too without any flag flip.

Net effect: existing non-stretch deployments see a cert with a few additional well-formed SANs; nothing changes for verification of the hostnames they were already verifying. The bug is fixed for stretch / multicluster / crossClusterMode: flat setups where the broker advertises a 2-label hostname today.

Workaround in the meantime

We're running with spec.tls.enabled: false at the StretchCluster layer plus explicit spec.listeners.{kafka,admin,http,schemaRegistry,rpc}.tls.enabled: false on every listener (the chart still emits kafka_api_tls etc. against nonexistent cert paths if listener TLS isn't disabled per-listener). Confidentiality is provided by the underlay — in our cross-cloud scaffold, IPsec VPN tunnels between clouds — but a single-cluster non-stretch deployment doesn't have that fallback.

Environment

  • Operator chart: redpanda-data/operator @ 26.2.1-beta.1
  • Redpanda image: redpandadata/redpanda:v26.1.6
  • cert-manager: v1.16.2
  • Tested on EKS 1.34, GKE 1.35.3, AKS 1.34 — same behavior on all three, doesn't depend on the K8s flavor.
  • Kubernetes / OpenSSL versions in the broker image follow upstream defaults; OpenSSL 3.x is strict about RFC 6125 single-label parents (3.0+ deprecated and 3.1+ rejects them by default).

cc: scaffold for full reproduction at https://github.com/david-yu/redpanda-operator-stretch-cross-cloud-beta (cross-cloud) — the same-cloud variant https://github.com/david-yu/redpanda-operator-stretch-beta appears to work because the broker pods land on the same Kubernetes context's local DNS where the chart uses longer FQDNs; the cross-cluster flat mode short-form is the trigger.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions