Add DDA rollback functionality by khewonc · Pull Request #2838 · DataDog/datadog-operator

khewonc · 2026-03-26T20:43:53Z

What does this PR do?

Adds rollback functionality for DDAs with fleet experiments. Includes:

stop: trigger rollback (for eventual stopExperiment)
timeout: rollback after 15min of a running experiment
abort: user makes a manual change (ignores manual change if done at the same time as the timeout b/c of complexity)

Motivation

https://datadoghq.atlassian.net/browse/CONTP-1404

Additional Notes

Anything else we should know when reviewing?

Minimum Agent Versions

Are there minimum versions of the Datadog Agent and/or Cluster Agent required?

Agent: vX.Y.Z
Cluster Agent: vX.Y.Z

Describe your test plan

TBA. Test commands for now:

Stop signal

# update dda spec to create new revision
kubectl patch dd <name> --type=merge -p '{"spec":{"global":{"tags":["env:test"]}}}'
# new experiment mock (generation should match dda generation after patch)
kubectl patch dd <name> --type=merge --subresource=status -p "{\"status\":{\"experiment\":{\"phase\":\"running\",\"generation\":2,\"id\":\"exp-1\"}}}"
# stop experiment mock
kubectl patch dd <name> --type=merge --subresource status --patch 'status: {experiment: {phase: stopped}}'

Timeout

# update dda spec to create new revision
kubectl patch dd <name> --type=merge -p '{"spec":{"global":{"tags":["env:test"]}}}'
# new experiment mock (generation should match dda generation after patch)
kubectl patch dd <name> --type=merge --subresource=status -p "{\"status\":{\"experiment\":{\"phase\":\"running\",\"generation\":2,\"id\":\"exp-1\"}}}"
# wait 15 min (starting from when the new revision was created)

Abort

# update dda spec to create new revision
kubectl patch dd <name> --type=merge -p '{"spec":{"global":{"tags":["env:test"]}}}'
# new experiment mock (generation should match dda generation after patch)
kubectl patch dd <name> --type=merge --subresource=status -p "{\"status\":{\"experiment\":{\"phase\":\"running\",\"generation\":2,\"id\":\"exp-1\"}}}"
# update dda spec again
kubectl patch dd <name> --type=merge -p '{"spec":{"global":{"tags":["env:foo"]}}}'

Checklist

PR has at least one valid label: bug, enhancement, refactoring, documentation, tooling, and/or dependencies
PR has a milestone or the qa/skip-qa label
All commits are signed (see: signing commits)

codecov-commenter · 2026-03-26T20:52:11Z

Codecov Report

❌ Patch coverage is 75.72254% with 42 lines in your changes missing coverage. Please review.
✅ Project coverage is 40.15%. Comparing base (a1206ff) to head (6106687).
⚠️ Report is 9 commits behind head on main.

Files with missing lines	Patch %	Lines
internal/controller/datadogagent_controller.go	41.02%	17 Missing and 6 partials ⚠️
internal/controller/datadogagent/experiment.go	84.78%	7 Missing and 7 partials ⚠️
...controller/datadogagent/controller_reconcile_v2.go	16.66%	2 Missing and 3 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2838      +/-   ##
==========================================
+ Coverage   38.94%   40.15%   +1.21%     
==========================================
  Files         313      315       +2     
  Lines       27139    27956     +817     
==========================================
+ Hits        10570    11227     +657     
- Misses      15780    15919     +139     
- Partials      789      810      +21

Flag	Coverage Δ
unittests	`40.15% <75.72%> (+1.21%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
internal/controller/datadogagent/controller.go	`94.73% <ø> (+1.87%)`	⬆️
...ler/datadogagent/controller_reconcile_v2_common.go	`33.90% <100.00%> (+0.28%)`	⬆️
...er/datadogagent/controller_reconcile_v2_helpers.go	`65.00% <100.00%> (+0.17%)`	⬆️
internal/controller/datadogagent/revision.go	`81.51% <100.00%> (+3.53%)`	⬆️
...controller/datadogagent/controller_reconcile_v2.go	`61.00% <16.66%> (-1.06%)`	⬇️
internal/controller/datadogagent/experiment.go	`84.78% <84.78%> (ø)`
internal/controller/datadogagent_controller.go	`59.54% <41.02%> (-7.13%)`	⬇️

... and 14 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a1206ff...6106687. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d358c9c2bd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

internal/controller/datadogagent/experiment.go

internal/controller/datadogagent/controller_reconcile_v2.go

zhuminyi · 2026-03-27T19:08:00Z

internal/controller/datadogagent/experiment.go

+
+	ctx = ctrl.LoggerInto(ctx, ctrl.LoggerFrom(ctx).WithValues("experimentID", experiment.ID))
+
+	if err := r.handleRollback(ctx, instance, newStatus, now, revList); err != nil {


nit: It is better to call abortExperiment first to detects user edit. If there is a user edit, phase will be changed to aborted and user's edit will be preserved (this is a narrow window of race condtion).

I initially had it called first as an early return, but calling abortExperiment first right now causes the operator logs to look like it's aborting when it's actually in timeout. I ended up deciding to reorder over complicating the function

arbll · 2026-04-01T14:15:20Z

internal/controller/datadogagent/controller_reconcile_v2.go

+		if err := r.manageExperiment(ctx, instance, newDDAStatus, now, revList); err != nil {
+			return r.updateStatusIfNeededV2(logger, instance, newDDAStatus, result, err, now)
+		}
+		if err := r.manageRevision(ctx, instance, revList, newDDAStatus); err != nil {
+			return r.updateStatusIfNeededV2(logger, instance, newDDAStatus, result, err, now)
+		}


Is this safe to do in two steps ? What if the second steps fails after the first step succeeded ?

Like won't this prevent rollbacks after apply ?

It should be fine to do in two steps. The actual rollback is handled in manageExperiment so it won't prevent experiment rollbacks. We don't allow user-initiated manual rollbacks so no issues there. There is one bug though in that after a rollback, if the user tries to apply the same change again, the operator will immediately roll back so it looks like there was no change. I'll add a fix for that

khewonc added 4 commits March 26, 2026 16:37

Add dd annotations and experiment phase as predicates

e9a4fa4

Add experiment status

5c6deda

Pass revision list instead of entire object

7474895

Add rollback functionality

b6980bf

khewonc added this to the v1.26.0 milestone Mar 26, 2026

khewonc added the enhancement New feature or request label Mar 26, 2026

github-actions bot added team/container-platform team/container-autoscaling labels Mar 26, 2026

make generate manifests

d358c9c

khewonc marked this pull request as ready for review March 27, 2026 16:31

khewonc requested a review from a team March 27, 2026 16:31

khewonc requested a review from a team as a code owner March 27, 2026 16:31

chatgpt-codex-connector bot reviewed Mar 27, 2026

View reviewed changes

internal/controller/datadogagent/experiment.go Outdated Show resolved Hide resolved

internal/controller/datadogagent/experiment.go Show resolved Hide resolved

internal/controller/datadogagent/controller_reconcile_v2.go Outdated Show resolved Hide resolved

zhuminyi reviewed Mar 27, 2026

View reviewed changes

zhuminyi approved these changes Mar 30, 2026

View reviewed changes

Address review suggestions

7814314

arbll reviewed Apr 1, 2026

View reviewed changes

Allow applying same change after rollback

6106687

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DDA rollback functionality#2838

Add DDA rollback functionality#2838
khewonc wants to merge 7 commits intomainfrom
khewonc/rollback

khewonc commented Mar 26, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Mar 26, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhuminyi Mar 27, 2026

Uh oh!

khewonc Mar 31, 2026

Uh oh!

arbll Apr 1, 2026

Uh oh!

arbll Apr 1, 2026

Uh oh!

khewonc Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		ctx = ctrl.LoggerInto(ctx, ctrl.LoggerFrom(ctx).WithValues("experimentID", experiment.ID))

		if err := r.handleRollback(ctx, instance, newStatus, now, revList); err != nil {

Conversation

khewonc commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Motivation

Additional Notes

Minimum Agent Versions

Describe your test plan

Checklist

Uh oh!

codecov-commenter commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhuminyi Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

khewonc Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

arbll Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

arbll Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

khewonc Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

khewonc commented Mar 26, 2026 •

edited

Loading

codecov-commenter commented Mar 26, 2026 •

edited

Loading