Skip to content

Conversation

@abg
Copy link
Member

@abg abg commented Jan 20, 2026

Previously, the failover test only ran the "pre-start", and triggered the "start" phase of the bosh lifecycle but monit jobs could still be coming online.

Now, the failover test runs "bosh cloud-check" with the "--resolution=recreate_vm" option to wait for the full bosh lifecycle, including the post-start job to be successful.

In our test environments, it was observed that "bosh cloud-check --resolution=recreate_vm" could sometimes flag mutliple vms with "unresponsive agent" and during the repair phase observe the agent was actually responsive causing bosh cloud-check to fail even though nothing was wrong. To avoid this problem, bosh cloud-check is run once to resolve the missing vm issue but any errors are ignored. A second bosh cloud-check --report is run a second time to validate there are no remaining deployment issues.

TNZ-67462

Previously, the failover test only ran the "pre-start", and triggered
the "start" phase of the bosh lifecycle but monit jobs could still be
coming online.

Now, the failover test runs "bosh cloud-check" with the
"--resolution=recreate_vm" option to wait for the full bosh lifecycle,
including the post-start job to be successful.

In our test environments, it was observed that "bosh cloud-check
--resolution=recreate_vm" could sometimes flag mutliple vms with
"unresponsive agent" and during the repair phase observe the agent was
_actually_ responsive causing bosh cloud-check to fail even though
nothing was wrong.  To avoid this problem, bosh cloud-check is run once
to resolve the missing vm issue but any errors are ignored.  A second
bosh cloud-check --report is run a second time to validate there are no
remaining deployment issues.

[TNZ-67462](https://vmw-jira.broadcom.net/browse/TNZ-67462)

Authored-by: Andrew Garner <andrew.garner@broadcom.com>
@abg abg requested a review from kimago January 20, 2026 20:39
@abg abg merged commit b2691bf into main Jan 21, 2026
2 checks passed
@abg abg deleted the tnz-67462/fix-flaky-failover-test branch January 21, 2026 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

1 participant