Skip to content

Operator 1.24.0 crashloops on startup when cluster-wide secrets RBAC is missing (regression from #2530) #2791

@Paul-Weaver

Description

@Paul-Weaver

Pre-submission Checklist

  • I have searched existing issues and this is not a duplicate
  • This is a Datadog Operator issue (CRDs, reconciliation, etc.), not a Datadog Agent or Datadog service problem (dashboards, monitors, etc.)

Operator version

1.24.0

Operator Helm chart version

2.19.1

Bug Report

Operator 1.24.0 (chart 2.19.x) crashloops on startup when the ServiceAccount
lacks cluster-wide list/watch on secrets.

We run rbac.create: false with a hand-managed ClusterRole scoped to only
what we need (Datadog CRDs, events, namespaces, leases, configmaps). We do
not grant cluster-wide secrets access by policy.

On 1.23.1 (chart 2.18.1) this was fine — the operator logged a warning
("Unable to get Secret informer, Helm metadata collection will be disabled"),
skipped the Helm metadata path, and started normally.

On 1.24.0 the same missing permission causes a cache sync timeout on
*v1.Secret, which kills mgr.Start() and the pod exits with code 1.
The behavior is nondeterministic — sometimes it starts, sometimes it doesn't,
depending on informer sync ordering/timing.

This appears to be a side effect of #2530 ("Remove early exit in
credential-dependent controllers when credentials are missing"), which
changed how the operator handles missing credentials/informers at startup.

Expected: operator starts and degrades gracefully (disables Helm metadata)
when secrets RBAC is absent, same as 1.23.1.

Actual: nondeterministic crash on startup due to Secret informer cache
sync timeout.

Steps to Reproduce

  1. Deploy chart 2.19.x with rbac.create: false
  2. Create a ClusterRole that does NOT include secrets list/watch at cluster scope
  3. Enable at least one controller (e.g. datadogMonitor + datadogDashboard)
  4. Set operatorMetricsEnabled: "true" (default)
  5. Observe pod — it will either start with forbidden errors or crashloop
    with "failed waiting for *v1.Secret Informer to sync" / exit code 1

Environment

  • Operator: 1.24.0 (works on 1.23.1)
  • Chart: datadog-operator 2.19.1
  • Kubernetes: 1.29
  • Helm: 3.x

Additional Context

Workarounds:

  • Pin to chart 2.18.1 / operator 1.23.1
  • Grant cluster-wide secrets get/list/watch
  • Set operatorMetricsEnabled: "false"

The Helm metadata feature (Secret informer for sh.helm.release.v1) probably
shouldn't be in the startup critical path when the permission is missing.
Ideally it would fail open like it did in 1.23.1.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpending

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions