Skip to content

operator: emit v2 Redpanda Prometheus metrics#1491

Open
david-yu wants to merge 3 commits intomainfrom
operator-v2-metrics
Open

operator: emit v2 Redpanda Prometheus metrics#1491
david-yu wants to merge 3 commits intomainfrom
operator-v2-metrics

Conversation

@david-yu
Copy link
Copy Markdown
Contributor

@david-yu david-yu commented Apr 28, 2026

Summary

Adds a v2 metrics reconciler that mirrors the existing v1 ClusterMetricController for the cluster.redpanda.com/v1alpha2 Redpanda CRD. The v2 controllers in operator/internal/controller/redpanda/ previously had no custom Prometheus metrics, leaving v2 deployments with only the kafka-client and build-info gauges that v1 deployments also have.

Metrics added

Metric Type Labels Description
redpandas_total Gauge Number of Redpanda clusters managed by the operator
redpanda_desired_nodes GaugeVec namespace, name Desired broker count per Redpanda (sum across pools)
redpanda_ready_nodes GaugeVec namespace, name Ready broker count per Redpanda (sum across pools)
redpanda_misconfigured_clusters GaugeVec reason Count of Redpandas whose ConfigurationApplied condition is not True, by reason

Design notes

  • Watches only the Redpanda CRD. StretchCluster metrics are out of scope for this PR — see "Why not StretchCluster?" below.
  • The reconciler ignores the incoming request and recomputes everything from a fresh List against every cluster known to the multicluster manager — matching the v1 pattern.
  • Unreachable provider clusters are skipped via Manager.IsClusterReachable, and per-cluster List failures are logged and skipped instead of aborting the reconcile. Partial metrics from healthy clusters are more useful than no metrics at all.
  • The reconciler is wrapped with controller.FilterNamespaceReconciler so it honors --namespace, matching every other v2 controller in this package.
  • Engages with both local and provider clusters via the multicluster manager.
  • Naming: gauges are not suffixed with _total (the v1 names like redpanda_clusters_total violate the Prometheus convention that _total is reserved for counters; the v2 names follow the convention).
  • The closest v2 equivalent of the v1 Configured condition is ConfigurationApplied; the misconfigured gauge tracks Redpandas where that condition is not True.
  • A namespace label is included on per-Redpanda gauges (the v1 metric did not), since v2 Redpanda names can collide across namespaces.
  • Existing RBAC on the v2 controller (+kubebuilder:rbac:groups=cluster.redpanda.com,resources=redpandas,verbs=get;list;watch;...) already covers what the metrics reconciler needs, so no RBAC or CRD regeneration is required.

Why not StretchCluster?

An earlier revision of this PR also watched redpandav1alpha2.StretchCluster to give parity to the multicluster operator mode. That broke CI (build #13347): the StretchCluster CRD is installed only by the operator helm chart's multicluster path (operator/cmd/crd/crd.go's multiclusterCRDs). Watching the type from cmd/run (the regular single-cluster operator) makes the controller-runtime cache informer fail to sync because the API resource does not exist there, the operator pod stays Not Ready, and every downstream test times out.

operator/cmd/multicluster/multicluster.go has its own setup flow and does not call this reconciler today, so the StretchCluster code path was never actually reachable in production anyway. The right way to add StretchCluster coverage is a follow-up that wires a separate registration alongside SetupMulticlusterController in cmd/multicluster, not in this PR.

Configuring metrics

The operator's metrics endpoint is exposed by controller-runtime. Under the helm chart it is on by default (the chart unconditionally passes --metrics-bind-address=:8443 regardless of config.metrics.bindAddress in values.yaml). The raw operator binary defaults to disabled.

Operator flags

Flag Default (binary) Effect
--metrics-bind-address 0 (disabled) Bind address for the metrics server. Set :8443 for HTTPS, :8080 for HTTP, or leave 0 to disable. The chart hard-codes this to :8443.
--metrics-secure true When the endpoint is enabled, serve over HTTPS with the controller-runtime auth/authz filter (requires bearer-token + RBAC get on /metrics). Pass --metrics-secure=false for plain HTTP — only safe for local dev.
--enable-redpanda-controllers true v2 mode. Required for this PR's metrics: when v2 is on, the RedpandaMetricsReconciler is registered.
--enable-vectorized-controllers false v1 mode. When on, the v1 ClusterMetricController is also registered, alongside v2 if both are enabled.
--namespace "" (cluster-wide) If set, the metrics reconciler only counts CRs in that namespace, matching every other v2 controller.

Helm chart (operator/chart/values.yaml)

monitoring:
  enabled: false   # set true to render a ServiceMonitor that scrapes /metrics over HTTPS via the SA bearer token

crds:
  enabled: false   # MUST be true on a fresh install — the Console controller crashes without its CRD

When monitoring.enabled=true the chart emits a ServiceMonitor (operator/chart/templates/_servicemonitor.go.tpl) and a Service named <release>-metrics-service (port https, 8443). The ServiceMonitor uses bearerTokenFile=/var/run/secrets/kubernetes.io/serviceaccount/token and insecureSkipVerify=true for the self-signed cert. Prometheus-Operator or VictoriaMetrics-Operator will pick it up automatically.

Enabling Operator metrics with a ServiceMonitor — step-by-step

The following walks through what was used to verify this PR end-to-end on a fresh local k3d cluster. The same steps work on any cluster that has the Prometheus-Operator CRDs installed.

1. Spin up a Kubernetes cluster

k3d cluster create rp-metrics-test \
  --image rancher/k3s:v1.32.13-k3s1 \
  --servers 1 --agents 0 --no-lb \
  --k3s-arg "--disable=traefik@server:0"

2. Install Prometheus + the ServiceMonitor CRD

kube-prometheus-stack ships the monitoring.coreos.com CRDs (including ServiceMonitor) and a Prometheus instance configured to discover them. The *SelectorNilUsesHelmValues=false flags below are important — without them Prometheus only picks up ServiceMonitors that carry the chart's release label, and the operator chart's ServiceMonitor does not.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

kubectl create namespace monitoring

helm install kube-prom prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --set grafana.enabled=false \
  --set alertmanager.enabled=false \
  --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
  --set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false \
  --set prometheus.prometheusSpec.scrapeInterval=15s \
  --wait --timeout 5m

3. Install cert-manager (operator dependency)

helm repo add jetstack https://charts.jetstack.io
helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager --create-namespace \
  --version v1.17.2 \
  --set crds.enabled=true \
  --wait --timeout 5m

4. Build the operator image from this branch and load it into the cluster

git checkout operator-v2-metrics
nix develop -c bash -c 'BUILD_GOOS=linux BUILD_GOARCH=arm64 task build:operator build:alias'

docker buildx build \
  --platform linux/arm64 \
  --provenance false --sbom false \
  --file operator/Dockerfile \
  --target=manager \
  --tag localhost/redpanda-operator:metrics-test \
  --load \
  .build

k3d image import localhost/redpanda-operator:metrics-test -c rp-metrics-test

(Substitute linux/amd64 for linux/arm64 on x86 hosts.)

5. Install the operator helm chart with monitoring.enabled=true

helm install rp-operator operator/chart \
  --namespace redpanda-operator --create-namespace \
  --set image.repository=localhost/redpanda-operator \
  --set image.tag=metrics-test \
  --set image.pullPolicy=Never \
  --set monitoring.enabled=true \
  --set crds.enabled=true \
  --wait --timeout 5m

This creates:

  • Deployment/rp-operator with --metrics-bind-address=:8443 --metrics-secure=true
  • Service/rp-operator-metrics-service exposing port https (8443) → container port https
  • ServiceMonitor/rp-operator-metrics-monitor (HTTPS scheme, bearer token, insecureSkipVerify=true)

6. Confirm Prometheus is scraping the operator

kubectl -n monitoring port-forward svc/kube-prom-kube-prometheus-prometheus 9090:9090 &

# Target should report health=up, no lastError
curl -s 'http://127.0.0.1:9090/api/v1/targets?state=active' \
  | jq -r '.data.activeTargets[] | select(.labels.service=="rp-operator-metrics-service") | "\(.health)  \(.scrapeUrl)  err=\(.lastError)"'
# → up  https://10.42.0.16:8443/metrics  err=

# Query each new metric
for q in redpandas_total redpanda_desired_nodes redpanda_ready_nodes redpanda_misconfigured_clusters; do
  echo "=== $q ==="
  curl -sG 'http://127.0.0.1:9090/api/v1/query' --data-urlencode "query=$q" \
    | jq -c '.data.result[] | {labels: .metric, value: .value[1]}'
done

You can also pull /metrics directly off the operator using the Prometheus SA bearer token:

TOKEN=$(kubectl -n monitoring create token kube-prom-kube-prometheus-prometheus)
kubectl -n redpanda-operator port-forward svc/rp-operator-metrics-service 18443:8443 &
curl -sk -H "Authorization: Bearer $TOKEN" https://127.0.0.1:18443/metrics | grep '^redpanda'

End-to-end test results

Cluster: k3d on rancher/k3s:v1.32.13-k3s1. Operator built from this branch (commit 72250721, redpanda_operator_build_info confirmed). The Prometheus target stayed health=up throughout every phase below. Metric values shown are the response from http://prometheus:9090/api/v1/query?query=....

Phase redpandas_total redpanda_desired_nodes redpanda_ready_nodes redpanda_misconfigured_clusters
No Redpanda CR 0 — (no series)
1-replica CR, broker OOM-crashing 1 1 (rp-test/redpanda) 0 {reason="NotReconciled"}=1
1-replica CR, broker Ready=True 1 1 1 — (cleared)
Scaled to 3, mid-rollout (1 of 3 ready) 1 3 1
Injected a bogus cluster.config.* setting 1 1 1 {reason="TerminalError"}=1
Removed the bogus setting 1 1 1 — (cleared)

What this verifies:

  • The chart correctly emits the ServiceMonitor and metrics Service, with selector labels that match.
  • Prometheus discovers the ServiceMonitor and successfully scrapes the operator's :8443/metrics over HTTPS using its SA bearer token (insecureSkipVerify covers the self-signed cert).
  • The new redpanda-metrics controller registers and reconciles (controller=redpanda-metrics … Starting workers in the operator log).
  • All four PR metrics emit and update correctly. Per-cluster gauges carry name=<cr>, namespace=<ns> (Prometheus relabels the metric's namespace to exported_namespace to avoid colliding with the kubelet-injected pod namespace — Prometheus default behavior, not something this PR controls).
  • redpanda_misconfigured_clusters correctly tracks the ConfigurationApplied condition's reason, including transitions out of misconfig (zero series → no # HELP / # TYPE lines emitted, which is correct GaugeVec semantics).

What's emitted from where, when metrics are enabled

/metrics always exposes the controller-runtime, client-go, Go, and process baselines. Operator-specific metrics depend on which controller flag is set:

Source Metrics Enabled when
cmd/version (always) redpanda_operator_build_info{version,commit,go_version} Always — registered at init() time.
pkg/client kgo hooks (always once a Kafka client is built) redpanda_operator_kafka_requests_sent_total, redpanda_operator_kafka_sent_bytes, redpanda_operator_kafka_requests_received_total, redpanda_operator_kafka_received_bytes First time any controller (v1 or v2) opens a Kafka connection — Topic / User / ACL / Schema / Group / Role controllers, plus the v2 RedpandaReconciler.
v1 ClusterMetricController (internal/controller/vectorized/metric_controller.go) redpanda_clusters_total, desired_redpanda_nodes_total, actual_redpanda_nodes_total, redpanda_misconfigured_clusters_total (note: pre-existing _total-suffixed gauges) --enable-vectorized-controllers=true (v1 mode).
v2 RedpandaMetricsReconciler (this PR) (internal/controller/redpanda/metric_controller.go) redpandas_total, redpanda_desired_nodes{namespace,name}, redpanda_ready_nodes{namespace,name}, redpanda_misconfigured_clusters{reason} --enable-redpanda-controllers=true (v2 mode, the default).
controller-runtime baseline (always, from imports) controller_runtime_reconcile_total, controller_runtime_reconcile_errors_total, controller_runtime_reconcile_time_seconds, controller_runtime_active_workers, controller_runtime_max_concurrent_reconciles, workqueue_*, rest_client_* Always, contributed per controller registered (Redpanda / NodePool / Topic / User / Role / Group / Schema / Console / RedpandaMetrics / etc.).
Go / process (always) go_*, process_* Always — promhttp default collectors.

Running with the chart defaults (--enable-redpanda-controllers=true, --enable-vectorized-controllers=false) you get the v2 metrics in this PR plus the always-on baselines, but not the v1 _total-suffixed gauges. Running both v1 and v2 (the side-by-side migration mode) gives you both metric families simultaneously — they don't share names, so no collision.

Files

  • operator/internal/controller/redpanda/metric_controller.go — new
  • operator/cmd/run/run.go — wires the reconciler in beside the v2 NodePool/Console controllers
  • .changes/unreleased/operator-Added-20260428-120000.yaml — changie entry

Test plan

Static checks run locally inside nix develop (Go 1.26.1, golangci-lint v2):

  • nix develop -c bash -c 'cd operator && go build ./...' — passed
  • nix develop -c bash -c 'cd operator && go vet ./...' — passed
  • nix develop -c bash -c 'cd operator && golangci-lint run ./internal/controller/redpanda/... ./cmd/run/...' — 0 issues
  • git diff --exit-code — clean, no generated-file drift

Buildkite CI (build #13347) initially failed on integration / acceptance / kuttl-v1-nodepools because of an unconditional StretchCluster watcher in cmd/run (the StretchCluster CRD is only installed in multicluster mode, so the cache informer never synced and the operator pod stayed Not Ready). Fixed in 72250721 by removing StretchCluster from this PR's scope; awaiting a fresh CI run.

End-to-end on a local k3d cluster (steps and results above):

  • Run the operator with --metrics-bind-address=:8443 --metrics-secure=true --enable-redpanda-controllers=true and verify /metrics exposes the four new gauges
  • Install kube-prometheus-stack + the operator chart with monitoring.enabled=true; confirm the rendered ServiceMonitor is discovered and the target reports health=up
  • Scale a Redpanda CR up/down and confirm redpanda_desired_nodes / redpanda_ready_nodes track correctly (1/1 → 3/1 mid-rollout)
  • Force a misconfiguration and confirm redpanda_misconfigured_clusters{reason="TerminalError"} appears, then clears when the misconfig is removed

david-yu and others added 3 commits April 28, 2026 14:07
Adds a v2 metrics reconciler that mirrors the existing v1
ClusterMetricController for the cluster.redpanda.com/v1alpha2 Redpanda
CRD. The operator currently emits cluster/node/misconfigured metrics
only for legacy Cluster CRs (v1), leaving v2 deployments with no custom
observability beyond the kafka client and build-info gauges.

Registered metrics:
  - redpandas_total
  - redpanda_desired_nodes{namespace,name}
  - redpanda_ready_nodes{namespace,name}
  - redpanda_misconfigured_clusters{reason}

The reconciler engages with both local and provider clusters via the
multicluster manager, so metrics reflect every Redpanda CR the operator
manages. Existing RBAC already grants list/watch on redpandas, so no
RBAC or CRD regeneration is required.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ehavior

Staff-review pass on the v2 metrics reconciler. Three correctness fixes:

- Cover the StretchCluster CRD: the v2 mode for multi-k8s-cluster
  Redpanda has the same Status.NodePools and ConfigurationApplied
  shape as the regular Redpanda CRD, so the same gauges apply. A
  single reconciler now backs two controllers (one per CRD) and
  serializes its two trigger paths with a mutex. StretchClusters are
  deduped by (namespace, name) since the same CR is mirrored across
  every participating k8s cluster.
- Add a `kind` label (values "Redpanda" / "StretchCluster") to all
  four gauges so the two CRDs can share metric names without colliding
  on (namespace, name). `redpandas_total` becomes a GaugeVec.
- Skip unreachable provider clusters and log+continue on per-cluster
  list failures instead of aborting the whole reconcile — partial
  metrics are more useful than none. Wrap with FilterNamespaceReconciler
  to honor `--namespace` like every other v2 controller.
CI uncovered the bug: my previous commit unconditionally registered a
controller-runtime watcher for StretchCluster from cmd/run, but the
StretchCluster CRD is only installed by the operator-helm chart in
multicluster mode (operator/cmd/crd/crd.go's multiclusterCRDs list).
In single-cluster mode the API resource does not exist, the cache
informer fails to sync, and manager.Start() returns an error — the
operator pod stays Not Ready, webhook calls get 'connection refused',
and every downstream integration/acceptance/kuttl test times out.

cmd/multicluster has its own setup flow and does not register this
metrics reconciler today, so the StretchCluster code path was never
actually reachable in production. Drop it from this PR — bring it
back as a follow-up that wires a separate registration into
cmd/multicluster, alongside SetupMulticlusterController.

Keep the safety hardening from the same commit: skip unreachable
clusters via Manager.IsClusterReachable, log+continue on per-cluster
list errors instead of aborting the reconcile, and wrap with
controller.FilterNamespaceReconciler so --namespace is honored.
The kind label and the dedup mutex are no longer needed and are
removed; the gauge labels match the original PR shape.
@david-yu
Copy link
Copy Markdown
Contributor Author

Just proposing backport to 25.3.x to limit backports given a customer is looking to adopt it in 25.3.x.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant