Skip to content

[DatadogSLO] Duplicate SLOs created when status update encounters conflict #2652

@samuelbaena

Description

@samuelbaena

Description

When using the Datadog operator alongside another controller (Crossplane in our case) that manages the same DatadogSLO resources, duplicate SLOs can be created in Datadog.

The issue appears to occur when the operator successfully creates an SLO in Datadog but then fails to update the CR status due to a Kubernetes optimistic concurrency conflict. On retry, since Status.ID is still empty, it creates another SLO.

Environment

  • Datadog Operator version: 1.23.1
  • Kubernetes version: 1.31
  • Other controller: Crossplane 1.18 (managing DatadogSLO as a composed resource)

Steps to Reproduce

  1. Use Crossplane (or another controller) to create a DatadogSLO resource
  2. Both Crossplane and the Datadog operator reconcile the CR simultaneously
  3. Crossplane updates metadata (labels, ownerReferences) while Datadog operator tries to update status

Expected Behavior

One SLO is created in Datadog per DatadogSLO CR.

Actual Behavior

Two SLOs are created in Datadog for a single DatadogSLO CR.

Operator Logs

{"level":"INFO","ts":"<timestamp>.424Z","msg":"Created a new DatadogSLO","SLO ID":"<slo-id-1>"}
{"level":"ERROR","ts":"<timestamp>.427Z","msg":"unable to update DatadogSLO status due to update conflict","error":"Operation cannot be fulfilled on datadogslos.datadoghq.com: the object has been modified; please apply your changes to the latest version and try again"}
{"level":"INFO","ts":"<timestamp>.688Z","msg":"Created a new DatadogSLO","SLO ID":"<slo-id-2>"}

The first SLO was created successfully, but the status update failed 3ms later. On retry (264ms later), a second SLO was created.

Race Condition Flow

Based on the logs, this appears to be the sequence:

1. Crossplane creates DatadogSLO CR
2. Datadog operator reconciles and creates SLO in Datadog API (succeeds)
3. Datadog operator tries to update CR status with the new SLO ID
4. MEANWHILE: Crossplane updates the CR (adds ownerReferences, labels) → bumps resourceVersion
5. Datadog operator's status update fails: "the object has been modified"
6. On retry, operator sees status.ID == "" → creates ANOTHER SLO in Datadog

The core issue seems to be that when the status update fails due to a conflict, the successfully-created SLO ID is lost. On the next reconciliation, Status.ID is still empty, so the controller creates a new SLO instead of recognizing one already exists.

Impact

  • Severity: Low - SLOs are functional, but duplicates cause confusion
  • Workaround: Manually delete duplicate SLOs from Datadog UI

Additional Context

This may be related to how multi-controller environments interact with the operator. I'm happy to provide more details or test any fixes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions