-
Notifications
You must be signed in to change notification settings - Fork 146
Operator 1.24.0 crashloops on startup when cluster-wide secrets RBAC is missing (regression from #2530) #2791
Description
Pre-submission Checklist
- I have searched existing issues and this is not a duplicate
- This is a Datadog Operator issue (CRDs, reconciliation, etc.), not a Datadog Agent or Datadog service problem (dashboards, monitors, etc.)
Operator version
1.24.0
Operator Helm chart version
2.19.1
Bug Report
Operator 1.24.0 (chart 2.19.x) crashloops on startup when the ServiceAccount
lacks cluster-wide list/watch on secrets.
We run rbac.create: false with a hand-managed ClusterRole scoped to only
what we need (Datadog CRDs, events, namespaces, leases, configmaps). We do
not grant cluster-wide secrets access by policy.
On 1.23.1 (chart 2.18.1) this was fine — the operator logged a warning
("Unable to get Secret informer, Helm metadata collection will be disabled"),
skipped the Helm metadata path, and started normally.
On 1.24.0 the same missing permission causes a cache sync timeout on
*v1.Secret, which kills mgr.Start() and the pod exits with code 1.
The behavior is nondeterministic — sometimes it starts, sometimes it doesn't,
depending on informer sync ordering/timing.
This appears to be a side effect of #2530 ("Remove early exit in
credential-dependent controllers when credentials are missing"), which
changed how the operator handles missing credentials/informers at startup.
Expected: operator starts and degrades gracefully (disables Helm metadata)
when secrets RBAC is absent, same as 1.23.1.
Actual: nondeterministic crash on startup due to Secret informer cache
sync timeout.
Steps to Reproduce
- Deploy chart 2.19.x with
rbac.create: false - Create a ClusterRole that does NOT include
secretslist/watch at cluster scope - Enable at least one controller (e.g.
datadogMonitor+datadogDashboard) - Set
operatorMetricsEnabled: "true"(default) - Observe pod — it will either start with forbidden errors or crashloop
with "failed waiting for *v1.Secret Informer to sync" / exit code 1
Environment
- Operator: 1.24.0 (works on 1.23.1)
- Chart: datadog-operator 2.19.1
- Kubernetes: 1.29
- Helm: 3.x
Additional Context
Workarounds:
- Pin to chart 2.18.1 / operator 1.23.1
- Grant cluster-wide secrets get/list/watch
- Set operatorMetricsEnabled: "false"
The Helm metadata feature (Secret informer for sh.helm.release.v1) probably
shouldn't be in the startup critical path when the permission is missing.
Ideally it would fail open like it did in 1.23.1.