[MISC] Fix wrong waiting logic in self-healing test, fix misuses of `wait_for_unit_status` by astrojuanlu · Pull Request #264 · canonical/mysql-operators

astrojuanlu · 2026-04-23T11:20:20Z

Issue

This test has always been broken, see https://canonical.github.io/mysql-operators/89/#suites/39f0662952b7076f02d119b4ba71c27e/47b6f482785863e3/history

With these changes, it passes locally.

Notice there's 2 things:

The waiting logic was wrong. Before it was checking for the app status, but we actually want to check for the workload status.
Still, the logic is prone to race conditions. I added successes=1 to mitigate that. Luckily, the charm code is not that fast, so this should be fine, but it isn't bullet-proof.

Solution

Checklist

I have added or updated any relevant documentation.
I have cleaned any remaining cloud resources from my accounts.

astrojuanlu · 2026-04-23T11:25:08Z

Oh btw, months ago I introduced some subtle bugs in our tests... pushed 1 more commit that fixes those.

…it_status misuses Agent-Logs-Url: https://github.com/canonical/mysql-operators/sessions/9dc045f5-e998-405f-918e-299a250ed5ef Co-authored-by: astrojuanlu <316517+astrojuanlu@users.noreply.github.com>

astrojuanlu · 2026-04-23T14:36:13Z

For full context, @sinclert-canonical pointed out that he had already tried #229 + #230, but tests weren't passing on 8.0/edge. I contend the successes=1 is the key part here, it decreases the probability of a race condition.

Test passing on 8.4 (this PR):

And on 8.0 (#265):

So I think this and the companion PR are safe to merge.

sinclert-canonical

Thanks for fixing the role tests!

Could you try reverting those tests wait conditions, to check for the app status, to see if it works? AFAIK, the approach was changed in this PR because of a Juju issue. Maybe that got solved 🤷🏻‍♂️

sinclert-canonical · 2026-04-23T15:08:55Z

+    # NOTE: This is prone to race conditions:
+    # if the units clear the "waiting" phase too quickly,
+    # this status function will never activate


No need to leave a comment if we are going to fix the problem here and now.

sinclert-canonical · 2026-04-23T15:13:41Z

+        ready=lambda status: any((
+            *(
+                wait_for_unit_status(MYSQL_APP_NAME, unit_name, "waiting")(status)
+                for unit_name in status.get_units(MYSQL_APP_NAME)
+            ),
+        )),


I am not sure having any unit in the waiting status is the required condition for this test to validate behavior upon cluster setup failure. There is a reason why this wait condition was initially targeting the application: we need to make sure no unit has actually setup the cluster.

Therefore, I think there are 3 possible ways to achieve this:

A) Wait for maintenance status at the app level.

B) Wait for maintenance status in all the units.

C) Wait for waiting status in all the units.

Given how brief (+ juju-controller dependent) the waiting status is, I would argue A or B are best.

astrojuanlu · 2026-04-23T16:41:09Z

juju/juju#22307 makes it more difficult to reliably debug this issue. Moreover, milliseconds arent' shown...

Fix wrong waiting logic in self-healing test

4f9f1b3

astrojuanlu requested review from paulomach and sinclert-canonical April 23, 2026 11:20

astrojuanlu mentioned this pull request Apr 23, 2026

[DPE-7908] Separation of storage (take 2) #257

Merged

2 tasks

github-actions Bot added the Libraries: Out of sync The charm libs used are out-of-sync label Apr 23, 2026

astrojuanlu added the not bug or enhancement PR is not 'bug' or 'enhancement'. For release notes label Apr 23, 2026

Fix misuse of wait_for_unit_status

f563a84

astrojuanlu changed the title ~~[MISC] Fix wrong waiting logic in self-healing test~~ [MISC] Fix wrong waiting logic in self-healing test, fix misuses of wait_for_unit_status Apr 23, 2026

Copilot AI mentioned this pull request Apr 23, 2026

[MISC] Fix wrong waiting logic in self-healing test, fix misuses of wait_for_unit_status (backport to 8.0/edge) #265

Open

2 tasks

sinclert-canonical reviewed Apr 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MISC] Fix wrong waiting logic in self-healing test, fix misuses of `wait_for_unit_status`#264

[MISC] Fix wrong waiting logic in self-healing test, fix misuses of `wait_for_unit_status`#264
astrojuanlu wants to merge 2 commits into8.4/edgefrom
juanlu/fix-setup-crash-test-k8s

astrojuanlu commented Apr 23, 2026

Uh oh!

astrojuanlu commented Apr 23, 2026

Uh oh!

astrojuanlu commented Apr 23, 2026

Uh oh!

sinclert-canonical left a comment

Uh oh!

sinclert-canonical Apr 23, 2026

Uh oh!

sinclert-canonical Apr 23, 2026

Uh oh!

astrojuanlu commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

astrojuanlu commented Apr 23, 2026

Issue

Solution

Checklist

Uh oh!

astrojuanlu commented Apr 23, 2026

Uh oh!

astrojuanlu commented Apr 23, 2026

Uh oh!

sinclert-canonical left a comment

Choose a reason for hiding this comment

Uh oh!

sinclert-canonical Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

sinclert-canonical Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

astrojuanlu commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants