operator: emit v2 Redpanda Prometheus metrics by david-yu · Pull Request #1491 · redpanda-data/redpanda-operator

david-yu · 2026-04-28T21:07:35Z

Summary

Adds a v2 metrics reconciler that mirrors the existing v1 ClusterMetricController for the cluster.redpanda.com/v1alpha2 Redpanda CRD. The v2 controllers in operator/internal/controller/redpanda/ previously had no custom Prometheus metrics, leaving v2 deployments with only the kafka-client and build-info gauges that v1 deployments also have.

Metrics added

Metric	Type	Labels	Description
`redpandas_total`	Gauge	—	Number of Redpanda clusters managed by the operator
`redpanda_desired_nodes`	GaugeVec	`namespace`, `name`	Desired broker count per Redpanda (sum across pools)
`redpanda_ready_nodes`	GaugeVec	`namespace`, `name`	Ready broker count per Redpanda (sum across pools)
`redpanda_misconfigured_clusters`	GaugeVec	`reason`	Count of Redpandas whose `ConfigurationApplied` condition is not True, by reason

Design notes

Watches only the Redpanda CRD. StretchCluster metrics are out of scope for this PR — see "Why not StretchCluster?" below.
The reconciler ignores the incoming request and recomputes everything from a fresh List against every cluster known to the multicluster manager — matching the v1 pattern.
Unreachable provider clusters are skipped via Manager.IsClusterReachable, and per-cluster List failures are logged and skipped instead of aborting the reconcile. Partial metrics from healthy clusters are more useful than no metrics at all.
The reconciler is wrapped with controller.FilterNamespaceReconciler so it honors --namespace, matching every other v2 controller in this package.
Engages with both local and provider clusters via the multicluster manager.
Naming: gauges are not suffixed with _total (the v1 names like redpanda_clusters_total violate the Prometheus convention that _total is reserved for counters; the v2 names follow the convention).
The closest v2 equivalent of the v1 Configured condition is ConfigurationApplied; the misconfigured gauge tracks Redpandas where that condition is not True.
A namespace label is included on per-Redpanda gauges (the v1 metric did not), since v2 Redpanda names can collide across namespaces.
Existing RBAC on the v2 controller (+kubebuilder:rbac:groups=cluster.redpanda.com,resources=redpandas,verbs=get;list;watch;...) already covers what the metrics reconciler needs, so no RBAC or CRD regeneration is required.

Why not StretchCluster?

An earlier revision of this PR also watched redpandav1alpha2.StretchCluster to give parity to the multicluster operator mode. That broke CI (build #13347): the StretchCluster CRD is installed only by the operator helm chart's multicluster path (operator/cmd/crd/crd.go's multiclusterCRDs). Watching the type from cmd/run (the regular single-cluster operator) makes the controller-runtime cache informer fail to sync because the API resource does not exist there, the operator pod stays Not Ready, and every downstream test times out.

operator/cmd/multicluster/multicluster.go has its own setup flow and does not call this reconciler today, so the StretchCluster code path was never actually reachable in production anyway. The right way to add StretchCluster coverage is a follow-up that wires a separate registration alongside SetupMulticlusterController in cmd/multicluster, not in this PR.

Configuring metrics

The operator's metrics endpoint is exposed by controller-runtime. Under the helm chart it is on by default (the chart unconditionally passes --metrics-bind-address=:8443 regardless of config.metrics.bindAddress in values.yaml). The raw operator binary defaults to disabled.

Operator flags

Flag	Default (binary)	Effect
`--metrics-bind-address`	`0` (disabled)	Bind address for the metrics server. Set `:8443` for HTTPS, `:8080` for HTTP, or leave `0` to disable. The chart hard-codes this to `:8443`.
`--metrics-secure`	`true`	When the endpoint is enabled, serve over HTTPS with the controller-runtime auth/authz filter (requires bearer-token + RBAC `get` on `/metrics`). Pass `--metrics-secure=false` for plain HTTP — only safe for local dev.
`--enable-redpanda-controllers`	`true`	v2 mode. Required for this PR's metrics: when v2 is on, the `RedpandaMetricsReconciler` is registered.
`--enable-vectorized-controllers`	`false`	v1 mode. When on, the v1 `ClusterMetricController` is also registered, alongside v2 if both are enabled.
`--namespace`	`""` (cluster-wide)	If set, the metrics reconciler only counts CRs in that namespace, matching every other v2 controller.

Helm chart (`operator/chart/values.yaml`)

monitoring:
  enabled: false   # set true to render a ServiceMonitor that scrapes /metrics over HTTPS via the SA bearer token

crds:
  enabled: false   # MUST be true on a fresh install — the Console controller crashes without its CRD

When monitoring.enabled=true the chart emits a ServiceMonitor (operator/chart/templates/_servicemonitor.go.tpl) and a Service named <release>-metrics-service (port https, 8443). The ServiceMonitor uses bearerTokenFile=/var/run/secrets/kubernetes.io/serviceaccount/token and insecureSkipVerify=true for the self-signed cert. Prometheus-Operator or VictoriaMetrics-Operator will pick it up automatically.

Enabling Operator metrics with a ServiceMonitor — step-by-step

The following walks through what was used to verify this PR end-to-end on a fresh local k3d cluster. The same steps work on any cluster that has the Prometheus-Operator CRDs installed.

1. Spin up a Kubernetes cluster

k3d cluster create rp-metrics-test \
  --image rancher/k3s:v1.32.13-k3s1 \
  --servers 1 --agents 0 --no-lb \
  --k3s-arg "--disable=traefik@server:0"

2. Install Prometheus + the `ServiceMonitor` CRD

kube-prometheus-stack ships the monitoring.coreos.com CRDs (including ServiceMonitor) and a Prometheus instance configured to discover them. The *SelectorNilUsesHelmValues=false flags below are important — without them Prometheus only picks up ServiceMonitors that carry the chart's release label, and the operator chart's ServiceMonitor does not.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

kubectl create namespace monitoring

helm install kube-prom prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --set grafana.enabled=false \
  --set alertmanager.enabled=false \
  --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
  --set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false \
  --set prometheus.prometheusSpec.scrapeInterval=15s \
  --wait --timeout 5m

3. Install cert-manager (operator dependency)

helm repo add jetstack https://charts.jetstack.io
helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager --create-namespace \
  --version v1.17.2 \
  --set crds.enabled=true \
  --wait --timeout 5m

4. Build the operator image from this branch and load it into the cluster

git checkout operator-v2-metrics
nix develop -c bash -c 'BUILD_GOOS=linux BUILD_GOARCH=arm64 task build:operator build:alias'

docker buildx build \
  --platform linux/arm64 \
  --provenance false --sbom false \
  --file operator/Dockerfile \
  --target=manager \
  --tag localhost/redpanda-operator:metrics-test \
  --load \
  .build

k3d image import localhost/redpanda-operator:metrics-test -c rp-metrics-test

(Substitute linux/amd64 for linux/arm64 on x86 hosts.)

5. Install the operator helm chart with `monitoring.enabled=true`

helm install rp-operator operator/chart \
  --namespace redpanda-operator --create-namespace \
  --set image.repository=localhost/redpanda-operator \
  --set image.tag=metrics-test \
  --set image.pullPolicy=Never \
  --set monitoring.enabled=true \
  --set crds.enabled=true \
  --wait --timeout 5m

This creates:

Deployment/rp-operator with --metrics-bind-address=:8443 --metrics-secure=true
Service/rp-operator-metrics-service exposing port https (8443) → container port https
ServiceMonitor/rp-operator-metrics-monitor (HTTPS scheme, bearer token, insecureSkipVerify=true)

6. Confirm Prometheus is scraping the operator

kubectl -n monitoring port-forward svc/kube-prom-kube-prometheus-prometheus 9090:9090 &

# Target should report health=up, no lastError
curl -s 'http://127.0.0.1:9090/api/v1/targets?state=active' \
  | jq -r '.data.activeTargets[] | select(.labels.service=="rp-operator-metrics-service") | "\(.health)  \(.scrapeUrl)  err=\(.lastError)"'
# → up  https://10.42.0.16:8443/metrics  err=

# Query each new metric
for q in redpandas_total redpanda_desired_nodes redpanda_ready_nodes redpanda_misconfigured_clusters; do
  echo "=== $q ==="
  curl -sG 'http://127.0.0.1:9090/api/v1/query' --data-urlencode "query=$q" \
    | jq -c '.data.result[] | {labels: .metric, value: .value[1]}'
done

You can also pull /metrics directly off the operator using the Prometheus SA bearer token:

TOKEN=$(kubectl -n monitoring create token kube-prom-kube-prometheus-prometheus)
kubectl -n redpanda-operator port-forward svc/rp-operator-metrics-service 18443:8443 &
curl -sk -H "Authorization: Bearer $TOKEN" https://127.0.0.1:18443/metrics | grep '^redpanda'

End-to-end test results

Cluster: k3d on rancher/k3s:v1.32.13-k3s1. Operator built from this branch (commit 72250721, redpanda_operator_build_info confirmed). The Prometheus target stayed health=up throughout every phase below. Metric values shown are the response from http://prometheus:9090/api/v1/query?query=....

Phase	`redpandas_total`	`redpanda_desired_nodes`	`redpanda_ready_nodes`	`redpanda_misconfigured_clusters`
No Redpanda CR	`0`	— (no series)	—	—
1-replica CR, broker OOM-crashing	`1`	`1` (rp-test/redpanda)	`0`	`{reason="NotReconciled"}=1`
1-replica CR, broker `Ready=True`	`1`	`1`	`1`	— (cleared)
Scaled to 3, mid-rollout (1 of 3 ready)	`1`	`3`	`1`	—
Injected a bogus `cluster.config.*` setting	`1`	`1`	`1`	`{reason="TerminalError"}=1`
Removed the bogus setting	`1`	`1`	`1`	— (cleared)

What this verifies:

The chart correctly emits the ServiceMonitor and metrics Service, with selector labels that match.
Prometheus discovers the ServiceMonitor and successfully scrapes the operator's :8443/metrics over HTTPS using its SA bearer token (insecureSkipVerify covers the self-signed cert).
The new redpanda-metrics controller registers and reconciles (controller=redpanda-metrics … Starting workers in the operator log).
All four PR metrics emit and update correctly. Per-cluster gauges carry name=<cr>, namespace=<ns> (Prometheus relabels the metric's namespace to exported_namespace to avoid colliding with the kubelet-injected pod namespace — Prometheus default behavior, not something this PR controls).
redpanda_misconfigured_clusters correctly tracks the ConfigurationApplied condition's reason, including transitions out of misconfig (zero series → no # HELP / # TYPE lines emitted, which is correct GaugeVec semantics).

What's emitted from where, when metrics are enabled

/metrics always exposes the controller-runtime, client-go, Go, and process baselines. Operator-specific metrics depend on which controller flag is set:

Source	Metrics	Enabled when
`cmd/version` (always)	`redpanda_operator_build_info{version,commit,go_version}`	Always — registered at `init()` time.
`pkg/client` kgo hooks (always once a Kafka client is built)	`redpanda_operator_kafka_requests_sent_total`, `redpanda_operator_kafka_sent_bytes`, `redpanda_operator_kafka_requests_received_total`, `redpanda_operator_kafka_received_bytes`	First time any controller (v1 or v2) opens a Kafka connection — Topic / User / ACL / Schema / Group / Role controllers, plus the v2 `RedpandaReconciler`.
v1 `ClusterMetricController` (`internal/controller/vectorized/metric_controller.go`)	`redpanda_clusters_total`, `desired_redpanda_nodes_total`, `actual_redpanda_nodes_total`, `redpanda_misconfigured_clusters_total` (note: pre-existing `_total`-suffixed gauges)	`--enable-vectorized-controllers=true` (v1 mode).
v2 `RedpandaMetricsReconciler` (this PR) (`internal/controller/redpanda/metric_controller.go`)	`redpandas_total`, `redpanda_desired_nodes{namespace,name}`, `redpanda_ready_nodes{namespace,name}`, `redpanda_misconfigured_clusters{reason}`	`--enable-redpanda-controllers=true` (v2 mode, the default).
controller-runtime baseline (always, from imports)	`controller_runtime_reconcile_total`, `controller_runtime_reconcile_errors_total`, `controller_runtime_reconcile_time_seconds`, `controller_runtime_active_workers`, `controller_runtime_max_concurrent_reconciles`, `workqueue_`, `rest_client_`	Always, contributed per controller registered (Redpanda / NodePool / Topic / User / Role / Group / Schema / Console / RedpandaMetrics / etc.).
Go / process (always)	`go_`, `process_`	Always — promhttp default collectors.

Running with the chart defaults (--enable-redpanda-controllers=true, --enable-vectorized-controllers=false) you get the v2 metrics in this PR plus the always-on baselines, but not the v1 _total-suffixed gauges. Running both v1 and v2 (the side-by-side migration mode) gives you both metric families simultaneously — they don't share names, so no collision.

Files

operator/internal/controller/redpanda/metric_controller.go — new
operator/cmd/run/run.go — wires the reconciler in beside the v2 NodePool/Console controllers
.changes/unreleased/operator-Added-20260428-120000.yaml — changie entry

Test plan

Static checks run locally inside nix develop (Go 1.26.1, golangci-lint v2):

nix develop -c bash -c 'cd operator && go build ./...' — passed
nix develop -c bash -c 'cd operator && go vet ./...' — passed
nix develop -c bash -c 'cd operator && golangci-lint run ./internal/controller/redpanda/... ./cmd/run/...' — 0 issues
git diff --exit-code — clean, no generated-file drift

Buildkite CI (build #13347) initially failed on integration / acceptance / kuttl-v1-nodepools because of an unconditional StretchCluster watcher in cmd/run (the StretchCluster CRD is only installed in multicluster mode, so the cache informer never synced and the operator pod stayed Not Ready). Fixed in 72250721 by removing StretchCluster from this PR's scope; awaiting a fresh CI run.

End-to-end on a local k3d cluster (steps and results above):

Run the operator with --metrics-bind-address=:8443 --metrics-secure=true --enable-redpanda-controllers=true and verify /metrics exposes the four new gauges
Install kube-prometheus-stack + the operator chart with monitoring.enabled=true; confirm the rendered ServiceMonitor is discovered and the target reports health=up
Scale a Redpanda CR up/down and confirm redpanda_desired_nodes / redpanda_ready_nodes track correctly (1/1 → 3/1 mid-rollout)
Force a misconfiguration and confirm redpanda_misconfigured_clusters{reason="TerminalError"} appears, then clears when the misconfig is removed

Adds a v2 metrics reconciler that mirrors the existing v1 ClusterMetricController for the cluster.redpanda.com/v1alpha2 Redpanda CRD. The operator currently emits cluster/node/misconfigured metrics only for legacy Cluster CRs (v1), leaving v2 deployments with no custom observability beyond the kafka client and build-info gauges. Registered metrics: - redpandas_total - redpanda_desired_nodes{namespace,name} - redpanda_ready_nodes{namespace,name} - redpanda_misconfigured_clusters{reason} The reconciler engages with both local and provider clusters via the multicluster manager, so metrics reflect every Redpanda CR the operator manages. Existing RBAC already grants list/watch on redpandas, so no RBAC or CRD regeneration is required. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ehavior Staff-review pass on the v2 metrics reconciler. Three correctness fixes: - Cover the StretchCluster CRD: the v2 mode for multi-k8s-cluster Redpanda has the same Status.NodePools and ConfigurationApplied shape as the regular Redpanda CRD, so the same gauges apply. A single reconciler now backs two controllers (one per CRD) and serializes its two trigger paths with a mutex. StretchClusters are deduped by (namespace, name) since the same CR is mirrored across every participating k8s cluster. - Add a `kind` label (values "Redpanda" / "StretchCluster") to all four gauges so the two CRDs can share metric names without colliding on (namespace, name). `redpandas_total` becomes a GaugeVec. - Skip unreachable provider clusters and log+continue on per-cluster list failures instead of aborting the whole reconcile — partial metrics are more useful than none. Wrap with FilterNamespaceReconciler to honor `--namespace` like every other v2 controller.

CI uncovered the bug: my previous commit unconditionally registered a controller-runtime watcher for StretchCluster from cmd/run, but the StretchCluster CRD is only installed by the operator-helm chart in multicluster mode (operator/cmd/crd/crd.go's multiclusterCRDs list). In single-cluster mode the API resource does not exist, the cache informer fails to sync, and manager.Start() returns an error — the operator pod stays Not Ready, webhook calls get 'connection refused', and every downstream integration/acceptance/kuttl test times out. cmd/multicluster has its own setup flow and does not register this metrics reconciler today, so the StretchCluster code path was never actually reachable in production. Drop it from this PR — bring it back as a follow-up that wires a separate registration into cmd/multicluster, alongside SetupMulticlusterController. Keep the safety hardening from the same commit: skip unreachable clusters via Manager.IsClusterReachable, log+continue on per-cluster list errors instead of aborting the reconcile, and wrap with controller.FilterNamespaceReconciler so --namespace is honored. The kind label and the dedup mutex are no longer needed and are removed; the gauge labels match the original PR shape.

david-yu · 2026-04-30T18:34:59Z

Just proposing backport to 25.3.x to limit backports given a customer is looking to adopt it in 25.3.x.

david-yu and others added 3 commits April 28, 2026 14:07

david-yu marked this pull request as ready for review April 29, 2026 17:05

david-yu requested review from RafalKorepta, andrewstucki, chrisseto, gene-redpanda and hidalgopl as code owners April 29, 2026 17:05

david-yu added v25.2.x v25.3.x v26.1.x and removed v25.2.x labels Apr 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

operator: emit v2 Redpanda Prometheus metrics#1491

operator: emit v2 Redpanda Prometheus metrics#1491
david-yu wants to merge 3 commits intomainfrom
operator-v2-metrics

david-yu commented Apr 28, 2026 •

edited

Loading

Uh oh!

david-yu commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

david-yu commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Metrics added

Design notes

Why not StretchCluster?

Configuring metrics

Operator flags

Helm chart (operator/chart/values.yaml)

Enabling Operator metrics with a ServiceMonitor — step-by-step

1. Spin up a Kubernetes cluster

2. Install Prometheus + the ServiceMonitor CRD

3. Install cert-manager (operator dependency)

4. Build the operator image from this branch and load it into the cluster

5. Install the operator helm chart with monitoring.enabled=true

6. Confirm Prometheus is scraping the operator

End-to-end test results

What's emitted from where, when metrics are enabled

Files

Test plan

Uh oh!

david-yu commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

david-yu commented Apr 28, 2026 •

edited

Loading

Helm chart (`operator/chart/values.yaml`)

2. Install Prometheus + the `ServiceMonitor` CRD

5. Install the operator helm chart with `monitoring.enabled=true`