[MISC] Fix wrong waiting logic in self-healing test, fix misuses of wait_for_unit_status#264
[MISC] Fix wrong waiting logic in self-healing test, fix misuses of wait_for_unit_status#264astrojuanlu wants to merge 2 commits into8.4/edgefrom
wait_for_unit_status#264Conversation
|
Oh btw, months ago I introduced some subtle bugs in our tests... pushed 1 more commit that fixes those. |
wait_for_unit_status
…it_status misuses Agent-Logs-Url: https://github.com/canonical/mysql-operators/sessions/9dc045f5-e998-405f-918e-299a250ed5ef Co-authored-by: astrojuanlu <316517+astrojuanlu@users.noreply.github.com>
|
For full context, @sinclert-canonical pointed out that he had already tried #229 + #230, but tests weren't passing on 8.0/edge. I contend the Test passing on 8.4 (this PR):
And on 8.0 (#265):
So I think this and the companion PR are safe to merge. |
sinclert-canonical
left a comment
There was a problem hiding this comment.
Thanks for fixing the role tests!
Could you try reverting those tests wait conditions, to check for the app status, to see if it works? AFAIK, the approach was changed in this PR because of a Juju issue. Maybe that got solved 🤷🏻♂️
| # NOTE: This is prone to race conditions: | ||
| # if the units clear the "waiting" phase too quickly, | ||
| # this status function will never activate |
There was a problem hiding this comment.
No need to leave a comment if we are going to fix the problem here and now.
| ready=lambda status: any(( | ||
| *( | ||
| wait_for_unit_status(MYSQL_APP_NAME, unit_name, "waiting")(status) | ||
| for unit_name in status.get_units(MYSQL_APP_NAME) | ||
| ), | ||
| )), |
There was a problem hiding this comment.
I am not sure having any unit in the waiting status is the required condition for this test to validate behavior upon cluster setup failure. There is a reason why this wait condition was initially targeting the application: we need to make sure no unit has actually setup the cluster.
Therefore, I think there are 3 possible ways to achieve this:
- A) Wait for
maintenancestatus at the app level. - B) Wait for
maintenancestatus in all the units. - C) Wait for
waitingstatus in all the units.
Given how brief (+ juju-controller dependent) the waiting status is, I would argue A or B are best.
|
juju/juju#22307 makes it more difficult to reliably debug this issue. Moreover, milliseconds arent' shown... |
Issue
This test has always been broken, see https://canonical.github.io/mysql-operators/89/#suites/39f0662952b7076f02d119b4ba71c27e/47b6f482785863e3/history
With these changes, it passes locally.
Notice there's 2 things:
successes=1to mitigate that. Luckily, the charm code is not that fast, so this should be fine, but it isn't bullet-proof.Solution
Checklist