-
Notifications
You must be signed in to change notification settings - Fork 67
Description
So far we tested disable DR when both primary and secondary clusters are up. In disaster use case we may need to disable DR when the one of the clusters is not responsive. In this case we may not be able to clean up a cluster or even get the status of the cluster using ManagedClusterView.
Simulating non responsive cluster is easy with virsh:
virsh -c qemu:///system suspend dr1Recover a cluster:
virsh -c qemu:///system resume dr1Tested during failover, suspend cluster before failover, resume after application running on the failover cluster.
Fix
Support marking a drcluster as unavailable. When cluster is unavailable:
- On the remaining managed cluster, do not access the s3 store from the unavailable cluster. This allows vrg delete flow to complete
- On the hub, ignore vrg from unavailable managed cluster, so waiting for vrg count to become zero succeeds, and drpc deletion flow completes.
- On the hub, do not try to validate the unavailable drcluster
Recommended flow
- Mark the cluster as unavailable
- Failover the application to the good cluster
- Fix the drpolicy predicates if needed
- Delete the drpc
- Delete the policy annotation disabling OCM scheduling
- When DR was disable for all applications, delete the drpolicy referencing the unavailable drcluster and the drcluster resource.
- Replace the unavailable cluster
- Enable DR again for the applications
Alternative flow
It the user will forget to mark a cluster as unavailable before disabling DR, disable dr will be stuck:
- The vrg on the remaining cluster will have the s3 profile name for the unavailable cluster, the vrg will be stuck in retry loop trying to access the unavailable s3 store.
- The drpc on the hub will be stuck waiting for the the stuck vrg and the stale vrg from the unavailable cluster reported by managedclusterview
Marking the cluster as unavailable should fix the issue but may require more manual work.
- Failover the application to the good cluster
- Fix the drpolicy predicates if needed
- Delete the drpc - stuck because the cluster is unavailable
- Mark the drcluster as unavailable to make delete drpc finish
- Delete the policy annotation disabling OCM scheduling
- When DR was disable for all applications, delete the drpolicy referencing the unavailable drcluster and the drcluster resource.
- Replace the unavailable cluster
- Enable DR again for the applications
Issues:
- After deleting the drpc and the vrg manifestwork changes in the manifestwork spec are not propagated to the managed cluster.
- May need to edit the vrg and remove the s3 profile name for the bad cluster
Tasks
- Test primary cluster failure: failover + disable dr
- Test secondary cluster failure: deploy + disable dr
- any change compared to first case?
- Support marking a drcluster as unavailable
- Skip unavailable drcluster when reconciling drcluster
- When creating VRG for manifestwork, include only s3 profiles from available drclusters
- When waiting for vrg count to become zero, ignore vrgs from unavailable drclusters
- Document replace cluster flow in docs/replace-cluster.md
Similar k8s flows: