Skip to content

[MISC] Fix wrong waiting logic in self-healing test, fix misuses of wait_for_unit_status#264

Open
astrojuanlu wants to merge 2 commits into8.4/edgefrom
juanlu/fix-setup-crash-test-k8s
Open

[MISC] Fix wrong waiting logic in self-healing test, fix misuses of wait_for_unit_status#264
astrojuanlu wants to merge 2 commits into8.4/edgefrom
juanlu/fix-setup-crash-test-k8s

Conversation

@astrojuanlu
Copy link
Copy Markdown
Contributor

Issue

This test has always been broken, see https://canonical.github.io/mysql-operators/89/#suites/39f0662952b7076f02d119b4ba71c27e/47b6f482785863e3/history

image

With these changes, it passes locally.

Notice there's 2 things:

  • The waiting logic was wrong. Before it was checking for the app status, but we actually want to check for the workload status.
  • Still, the logic is prone to race conditions. I added successes=1 to mitigate that. Luckily, the charm code is not that fast, so this should be fine, but it isn't bullet-proof.

Solution

Checklist

  • I have added or updated any relevant documentation.
  • I have cleaned any remaining cloud resources from my accounts.

@github-actions github-actions Bot added the Libraries: Out of sync The charm libs used are out-of-sync label Apr 23, 2026
@astrojuanlu astrojuanlu added the not bug or enhancement PR is not 'bug' or 'enhancement'. For release notes label Apr 23, 2026
@astrojuanlu
Copy link
Copy Markdown
Contributor Author

Oh btw, months ago I introduced some subtle bugs in our tests... pushed 1 more commit that fixes those.

@astrojuanlu astrojuanlu changed the title [MISC] Fix wrong waiting logic in self-healing test [MISC] Fix wrong waiting logic in self-healing test, fix misuses of wait_for_unit_status Apr 23, 2026
Copilot AI added a commit that referenced this pull request Apr 23, 2026
…it_status misuses

Agent-Logs-Url: https://github.com/canonical/mysql-operators/sessions/9dc045f5-e998-405f-918e-299a250ed5ef

Co-authored-by: astrojuanlu <316517+astrojuanlu@users.noreply.github.com>
@astrojuanlu
Copy link
Copy Markdown
Contributor Author

Copy link
Copy Markdown
Contributor

@sinclert-canonical sinclert-canonical left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing the role tests!

Could you try reverting those tests wait conditions, to check for the app status, to see if it works? AFAIK, the approach was changed in this PR because of a Juju issue. Maybe that got solved 🤷🏻‍♂️

Comment on lines +56 to +58
# NOTE: This is prone to race conditions:
# if the units clear the "waiting" phase too quickly,
# this status function will never activate
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to leave a comment if we are going to fix the problem here and now.

Comment on lines +62 to +67
ready=lambda status: any((
*(
wait_for_unit_status(MYSQL_APP_NAME, unit_name, "waiting")(status)
for unit_name in status.get_units(MYSQL_APP_NAME)
),
)),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure having any unit in the waiting status is the required condition for this test to validate behavior upon cluster setup failure. There is a reason why this wait condition was initially targeting the application: we need to make sure no unit has actually setup the cluster.

Therefore, I think there are 3 possible ways to achieve this:

  • A) Wait for maintenance status at the app level.
  • B) Wait for maintenance status in all the units.
  • C) Wait for waiting status in all the units.

Given how brief (+ juju-controller dependent) the waiting status is, I would argue A or B are best.

@astrojuanlu
Copy link
Copy Markdown
Contributor Author

juju/juju#22307 makes it more difficult to reliably debug this issue. Moreover, milliseconds arent' shown...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Libraries: Out of sync The charm libs used are out-of-sync not bug or enhancement PR is not 'bug' or 'enhancement'. For release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants