operator: emit v2 Redpanda Prometheus metrics#1491
Open
Conversation
Adds a v2 metrics reconciler that mirrors the existing v1
ClusterMetricController for the cluster.redpanda.com/v1alpha2 Redpanda
CRD. The operator currently emits cluster/node/misconfigured metrics
only for legacy Cluster CRs (v1), leaving v2 deployments with no custom
observability beyond the kafka client and build-info gauges.
Registered metrics:
- redpandas_total
- redpanda_desired_nodes{namespace,name}
- redpanda_ready_nodes{namespace,name}
- redpanda_misconfigured_clusters{reason}
The reconciler engages with both local and provider clusters via the
multicluster manager, so metrics reflect every Redpanda CR the operator
manages. Existing RBAC already grants list/watch on redpandas, so no
RBAC or CRD regeneration is required.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ehavior Staff-review pass on the v2 metrics reconciler. Three correctness fixes: - Cover the StretchCluster CRD: the v2 mode for multi-k8s-cluster Redpanda has the same Status.NodePools and ConfigurationApplied shape as the regular Redpanda CRD, so the same gauges apply. A single reconciler now backs two controllers (one per CRD) and serializes its two trigger paths with a mutex. StretchClusters are deduped by (namespace, name) since the same CR is mirrored across every participating k8s cluster. - Add a `kind` label (values "Redpanda" / "StretchCluster") to all four gauges so the two CRDs can share metric names without colliding on (namespace, name). `redpandas_total` becomes a GaugeVec. - Skip unreachable provider clusters and log+continue on per-cluster list failures instead of aborting the whole reconcile — partial metrics are more useful than none. Wrap with FilterNamespaceReconciler to honor `--namespace` like every other v2 controller.
CI uncovered the bug: my previous commit unconditionally registered a controller-runtime watcher for StretchCluster from cmd/run, but the StretchCluster CRD is only installed by the operator-helm chart in multicluster mode (operator/cmd/crd/crd.go's multiclusterCRDs list). In single-cluster mode the API resource does not exist, the cache informer fails to sync, and manager.Start() returns an error — the operator pod stays Not Ready, webhook calls get 'connection refused', and every downstream integration/acceptance/kuttl test times out. cmd/multicluster has its own setup flow and does not register this metrics reconciler today, so the StretchCluster code path was never actually reachable in production. Drop it from this PR — bring it back as a follow-up that wires a separate registration into cmd/multicluster, alongside SetupMulticlusterController. Keep the safety hardening from the same commit: skip unreachable clusters via Manager.IsClusterReachable, log+continue on per-cluster list errors instead of aborting the reconcile, and wrap with controller.FilterNamespaceReconciler so --namespace is honored. The kind label and the dedup mutex are no longer needed and are removed; the gauge labels match the original PR shape.
Contributor
Author
|
Just proposing backport to 25.3.x to limit backports given a customer is looking to adopt it in 25.3.x. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a v2 metrics reconciler that mirrors the existing v1
ClusterMetricControllerfor thecluster.redpanda.com/v1alpha2RedpandaCRD. The v2 controllers inoperator/internal/controller/redpanda/previously had no custom Prometheus metrics, leaving v2 deployments with only the kafka-client and build-info gauges that v1 deployments also have.Metrics added
redpandas_totalredpanda_desired_nodesnamespace,nameredpanda_ready_nodesnamespace,nameredpanda_misconfigured_clustersreasonConfigurationAppliedcondition is not True, by reasonDesign notes
Listagainst every cluster known to the multicluster manager — matching the v1 pattern.Manager.IsClusterReachable, and per-clusterListfailures are logged and skipped instead of aborting the reconcile. Partial metrics from healthy clusters are more useful than no metrics at all.controller.FilterNamespaceReconcilerso it honors--namespace, matching every other v2 controller in this package._total(the v1 names likeredpanda_clusters_totalviolate the Prometheus convention that_totalis reserved for counters; the v2 names follow the convention).Configuredcondition isConfigurationApplied; the misconfigured gauge tracks Redpandas where that condition is notTrue.namespacelabel is included on per-Redpanda gauges (the v1 metric did not), since v2 Redpanda names can collide across namespaces.+kubebuilder:rbac:groups=cluster.redpanda.com,resources=redpandas,verbs=get;list;watch;...) already covers what the metrics reconciler needs, so no RBAC or CRD regeneration is required.Why not StretchCluster?
An earlier revision of this PR also watched
redpandav1alpha2.StretchClusterto give parity to the multicluster operator mode. That broke CI (build #13347): the StretchCluster CRD is installed only by the operator helm chart's multicluster path (operator/cmd/crd/crd.go'smulticlusterCRDs). Watching the type fromcmd/run(the regular single-cluster operator) makes the controller-runtime cache informer fail to sync because the API resource does not exist there, the operator pod stays Not Ready, and every downstream test times out.operator/cmd/multicluster/multicluster.gohas its own setup flow and does not call this reconciler today, so the StretchCluster code path was never actually reachable in production anyway. The right way to add StretchCluster coverage is a follow-up that wires a separate registration alongsideSetupMulticlusterControllerincmd/multicluster, not in this PR.Configuring metrics
The operator's metrics endpoint is exposed by controller-runtime. Under the helm chart it is on by default (the chart unconditionally passes
--metrics-bind-address=:8443regardless ofconfig.metrics.bindAddressinvalues.yaml). The raw operator binary defaults to disabled.Operator flags
--metrics-bind-address0(disabled):8443for HTTPS,:8080for HTTP, or leave0to disable. The chart hard-codes this to:8443.--metrics-securetruegeton/metrics). Pass--metrics-secure=falsefor plain HTTP — only safe for local dev.--enable-redpanda-controllerstrueRedpandaMetricsReconcileris registered.--enable-vectorized-controllersfalseClusterMetricControlleris also registered, alongside v2 if both are enabled.--namespace""(cluster-wide)Helm chart (
operator/chart/values.yaml)When
monitoring.enabled=truethe chart emits aServiceMonitor(operator/chart/templates/_servicemonitor.go.tpl) and aServicenamed<release>-metrics-service(porthttps, 8443). The ServiceMonitor usesbearerTokenFile=/var/run/secrets/kubernetes.io/serviceaccount/tokenandinsecureSkipVerify=truefor the self-signed cert. Prometheus-Operator or VictoriaMetrics-Operator will pick it up automatically.Enabling Operator metrics with a ServiceMonitor — step-by-step
The following walks through what was used to verify this PR end-to-end on a fresh local k3d cluster. The same steps work on any cluster that has the Prometheus-Operator CRDs installed.
1. Spin up a Kubernetes cluster
k3d cluster create rp-metrics-test \ --image rancher/k3s:v1.32.13-k3s1 \ --servers 1 --agents 0 --no-lb \ --k3s-arg "--disable=traefik@server:0"2. Install Prometheus + the
ServiceMonitorCRDkube-prometheus-stackships themonitoring.coreos.comCRDs (includingServiceMonitor) and a Prometheus instance configured to discover them. The*SelectorNilUsesHelmValues=falseflags below are important — without them Prometheus only picks up ServiceMonitors that carry the chart'sreleaselabel, and the operator chart's ServiceMonitor does not.3. Install cert-manager (operator dependency)
4. Build the operator image from this branch and load it into the cluster
(Substitute
linux/amd64forlinux/arm64on x86 hosts.)5. Install the operator helm chart with
monitoring.enabled=trueThis creates:
Deployment/rp-operatorwith--metrics-bind-address=:8443 --metrics-secure=trueService/rp-operator-metrics-serviceexposing porthttps(8443) → container porthttpsServiceMonitor/rp-operator-metrics-monitor(HTTPS scheme, bearer token,insecureSkipVerify=true)6. Confirm Prometheus is scraping the operator
You can also pull
/metricsdirectly off the operator using the Prometheus SA bearer token:End-to-end test results
Cluster: k3d on
rancher/k3s:v1.32.13-k3s1. Operator built from this branch (commit72250721,redpanda_operator_build_infoconfirmed). The Prometheus target stayedhealth=upthroughout every phase below. Metric values shown are the response fromhttp://prometheus:9090/api/v1/query?query=....redpandas_totalredpanda_desired_nodesredpanda_ready_nodesredpanda_misconfigured_clusters011(rp-test/redpanda)0{reason="NotReconciled"}=1Ready=True111131cluster.config.*setting111{reason="TerminalError"}=1111What this verifies:
ServiceMonitorand metricsService, with selector labels that match.:8443/metricsover HTTPS using its SA bearer token (insecureSkipVerifycovers the self-signed cert).redpanda-metricscontroller registers and reconciles (controller=redpanda-metrics … Starting workersin the operator log).name=<cr>, namespace=<ns>(Prometheus relabels the metric'snamespacetoexported_namespaceto avoid colliding with the kubelet-injected pod namespace — Prometheus default behavior, not something this PR controls).redpanda_misconfigured_clusterscorrectly tracks theConfigurationAppliedcondition'sreason, including transitions out of misconfig (zero series → no# HELP/# TYPElines emitted, which is correct GaugeVec semantics).What's emitted from where, when metrics are enabled
/metricsalways exposes the controller-runtime, client-go, Go, and process baselines. Operator-specific metrics depend on which controller flag is set:cmd/version(always)redpanda_operator_build_info{version,commit,go_version}init()time.pkg/clientkgo hooks (always once a Kafka client is built)redpanda_operator_kafka_requests_sent_total,redpanda_operator_kafka_sent_bytes,redpanda_operator_kafka_requests_received_total,redpanda_operator_kafka_received_bytesRedpandaReconciler.ClusterMetricController(internal/controller/vectorized/metric_controller.go)redpanda_clusters_total,desired_redpanda_nodes_total,actual_redpanda_nodes_total,redpanda_misconfigured_clusters_total(note: pre-existing_total-suffixed gauges)--enable-vectorized-controllers=true(v1 mode).RedpandaMetricsReconciler(this PR) (internal/controller/redpanda/metric_controller.go)redpandas_total,redpanda_desired_nodes{namespace,name},redpanda_ready_nodes{namespace,name},redpanda_misconfigured_clusters{reason}--enable-redpanda-controllers=true(v2 mode, the default).controller_runtime_reconcile_total,controller_runtime_reconcile_errors_total,controller_runtime_reconcile_time_seconds,controller_runtime_active_workers,controller_runtime_max_concurrent_reconciles,workqueue_*,rest_client_*go_*,process_*Running with the chart defaults (
--enable-redpanda-controllers=true,--enable-vectorized-controllers=false) you get the v2 metrics in this PR plus the always-on baselines, but not the v1_total-suffixed gauges. Running both v1 and v2 (the side-by-side migration mode) gives you both metric families simultaneously — they don't share names, so no collision.Files
operator/internal/controller/redpanda/metric_controller.go— newoperator/cmd/run/run.go— wires the reconciler in beside the v2 NodePool/Console controllers.changes/unreleased/operator-Added-20260428-120000.yaml— changie entryTest plan
Static checks run locally inside
nix develop(Go 1.26.1, golangci-lint v2):nix develop -c bash -c 'cd operator && go build ./...'— passednix develop -c bash -c 'cd operator && go vet ./...'— passednix develop -c bash -c 'cd operator && golangci-lint run ./internal/controller/redpanda/... ./cmd/run/...'— 0 issuesgit diff --exit-code— clean, no generated-file driftBuildkite CI (build #13347) initially failed on integration / acceptance / kuttl-v1-nodepools because of an unconditional StretchCluster watcher in
cmd/run(the StretchCluster CRD is only installed in multicluster mode, so the cache informer never synced and the operator pod stayed Not Ready). Fixed in72250721by removing StretchCluster from this PR's scope; awaiting a fresh CI run.End-to-end on a local k3d cluster (steps and results above):
--metrics-bind-address=:8443 --metrics-secure=true --enable-redpanda-controllers=trueand verify/metricsexposes the four new gaugeskube-prometheus-stack+ the operator chart withmonitoring.enabled=true; confirm the renderedServiceMonitoris discovered and the target reportshealth=upredpanda_desired_nodes/redpanda_ready_nodestrack correctly (1/1 → 3/1 mid-rollout)redpanda_misconfigured_clusters{reason="TerminalError"}appears, then clears when the misconfig is removed