Disable DR when a cluster is not responsive

So far we tested disable DR when both primary and secondary clusters are up. In disaster use case we may need to disable DR when the one of the clusters is not responsive. In this case we may not be able to clean up a cluster or even get the status of the cluster using `ManagedClusterView`.

Simulating non responsive cluster is easy with virsh:

```sh
virsh -c qemu:///system suspend dr1
```

Recover a cluster:

```sh
virsh -c qemu:///system resume dr1
```

Tested during failover, suspend cluster before failover, resume after application running on the failover cluster.

## Fix

Support marking a drcluster as unavailable. When cluster is unavailable:
- On the remaining managed cluster, do not access the s3 store from the unavailable cluster. This allows vrg delete flow to complete
- On the hub, ignore vrg from unavailable managed cluster, so waiting for vrg count to become zero succeeds, and drpc deletion flow completes.
- On the hub, do not try to validate the unavailable drcluster

### Recommended flow

1. Mark the cluster as unavailable
1. Failover the application to the good cluster
1. Fix the drpolicy predicates if needed
1. Delete the drpc
1. Delete the policy annotation disabling OCM scheduling
1. When DR was disable for all applications, delete the drpolicy referencing the unavailable drcluster and the drcluster resource.
1. Replace the unavailable cluster
1. Enable DR again for the applications

### Alternative flow

It the user will forget to mark a cluster as unavailable before disabling DR, disable dr will be stuck:

- The vrg on the remaining cluster will have the s3 profile name for the unavailable cluster, the vrg will be stuck in retry loop trying to access the unavailable s3 store. 
- The drpc on the hub will be stuck waiting for the the stuck vrg and the stale vrg from the unavailable cluster reported by managedclusterview

Marking the cluster as unavailable should fix the issue but may require more manual work.

1. Failover the application to the good cluster
1. Fix the drpolicy predicates if needed
1. Delete the drpc - stuck because the cluster is unavailable
1. Mark the drcluster as unavailable to make delete drpc finish
1. Delete the policy annotation disabling OCM scheduling
1. When DR was disable for all applications, delete the drpolicy referencing the unavailable drcluster and the drcluster resource.
1. Replace the unavailable cluster
1. Enable DR again for the applications

Issues:
- After deleting the drpc and the vrg manifestwork changes in the manifestwork spec are not propagated to the managed cluster.
  - May need to edit the vrg and remove the s3 profile name for the bad cluster

## Tasks

- [x] Test primary cluster failure: failover + disable dr
- [ ] Test secondary cluster failure: deploy + disable dr
  - any change compared to first case?
- [ ] Support marking a drcluster as unavailable
- [ ] Skip unavailable drcluster when reconciling drcluster
- [ ] When creating VRG for manifestwork, include only s3 profiles from available drclusters
- [ ] When waiting for vrg count to become zero, ignore vrgs from unavailable drclusters
- [ ] Document replace cluster flow in [docs/replace-cluster.md](docs/replace-cluster.md)

Similar k8s flows:
- https://kubernetes.io/blog/2023/08/16/kubernetes-1-28-non-graceful-node-shutdown-ga/


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Disable DR when a cluster is not responsive #1139

Fix

Recommended flow

Alternative flow

Tasks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Disable DR when a cluster is not responsive #1139

Description

Fix

Recommended flow

Alternative flow

Tasks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions