PV stuck in released state after uninstalling linstor cluster

# PV stuck in Released state after uninstall (Linstor CSI)

## Summary

When uninstalling workloads that use Linstor CSI (piraeus-storage StorageClass), a PersistentVolume remains in **Released** state. The Linstor CSI controller reports `VolumeFailedDelete` with tie-breaker and DRBD meta-data errors. Our uninstall runs **DRBD cleanup (`drbdsetup down`) after** deleting the PVCs (including the one used by NFS server and NFS provisioner). We would like to know if this order is correct and what the recommended sequence is.

---

## Environment

- **Storage:** Linstor CSI (piraeus-storage StorageClass).
- **Reclaim policy:** Delete.
- **Cluster:** Multi-node; Linstor satellites on several nodes.
- **Affected PVC:** A PVC used by an NFS server (backed by Piraeus Storage Class). Another PVC is used by an NFS subdir provisioner.

---

## Observed behaviour

1. After deployment, the PV is **Bound** .
but gives this error

```
 Normal   Scheduled           2m11s                default-scheduler        Successfully assigned nsp-psa-privileged/nfs-server-5c98544f8-fn4dp to leek-node5
  Warning  FailedAttachVolume  64s (x8 over 2m11s)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-6f162b88-f2a7-47f7-b0ee-2b3b92241c19" : rpc error: code = Internal desc = ControllerPublishVolume failed for pvc-6f162b88-f2a7-47f7-b0ee-2b3b92241c19: could not determine device path
```
2. After uninstall, the PV moves to **Released** and never gets deleted.
3. `kubectl describe pv` shows repeated **VolumeFailedDelete** events from the Linstor CSI controller.
4. Re-deploying and uninstalling again reproduces the issue; the PV remains Released.

---

## Uninstall order we use (original)

We uninstall in this order:

1. Delete application workloads and their PVCs (except the NFS server PVC and NFS provisioner PVC).
2. Wait for those PVs to be cleaned up.
3. Delete the NFS server and NFS provisioner workloads (pods are gone; the Linstor-backed PVCs are no longer in use).
4. **Delete all remaining PVCs**, including the NFS server PVC and NFS provisioner PVC.
5. **DRBD cleanup:** For each Linstor satellite pod, run `drbdsetup status`, then `drbdsetup down <resource>` for each `pvc-*` resource shown. Remove lost-quorum taint. Wait until no `pvc-*` DRBD resources remain.
6. **Linstor cleanup:** Remove Linstor cluster and Piraeus operator (CSI controller is removed here).

So we run **DRBD cleanup (step 5) after we have already deleted the NFS server and NFS provisioner PVCs (step 4)**.

**Question:** Is this order correct? Should DRBD cleanup (`drbdsetup down`) run **before** we delete the NFS server PVC and NFS provisioner PVC, or is it correct to run it **after** those PVCs are deleted? What is the recommended uninstall sequence when using Linstor CSI with workloads (e.g. NFS server) that use Linstor-backed PVCs?

---

## PV describe output (events)

```
Warning  VolumeFailedDelete  37m  linstor.csi.linbit.com_linstor-csi-controller-...  rpc error: code = Internal desc = failed to delete volume: Message: 'Tie breaker marked for deletion' next error: Message: 'Node: node-a, Resource: pvc-6f162b88-f2a7-47f7-b0ee-2b3b92241c19 preparing for deletion.'; Details: 'Node: node-a, Resource: pvc-6f162b88-f2a7-47f7-b0ee-2b3b92241c19 UUID is: ed46abec-7a4e-4f9c-b41e-3d27866994a0' next error: Message: 'Preparing deletion of resource on 'node-a'' next error: Message: '(node-b) Failed to create meta-data for DRBD volume pvc-6f162b88-f2a7-47f7-b0ee-2b3b92241c19/0'; Reports: '[69A55B3B-D89E6-000019]' next error: Message: '(node-c) Failed to create meta-data for DRBD volume pvc-6f162b88-f2a7-47f7-b0ee-2b3b92241c19/0'; Reports: '[69A55B3E-AFF24-000020]' next error: Message: 'Deletion of resource 'pvc-6f162b88-f2a7-47f7-b0ee-2b3b92241c19' on node 'node-a' failed due to an unhandled exception of type DelayedApiRcException. Exceptions have been converted to responses'; Details: 'Node: node-a, Resource: pvc-6f162b88-f2a7-47f7-b0ee-2b3b92241c19'; Reports: '[69A55BB6-00000-000008]'
```

- **PV name:** pvc-6f162b88-f2a7-47f7-b0ee-2b3b92241c19
- **StorageClass:** piraeus-storage
- **Note:** node-a, node-b, node-c in the message are generic placeholders for the actual node names.

---

## DRBD cleanup output (during uninstall)

When we run DRBD cleanup (after the PVCs have been deleted), we see:

- On **one satellite node** (the one where the NFS server pod had run): `drbdsetup status` shows **"No currently configured DRBD found"** — so the volume does not appear there when we run cleanup.
- On **two other satellite nodes**: `drbdsetup status` shows a **different** PV resource (another volume). We run `drbdsetup down` for that one successfully.
- The **stuck volume** (pvc-6f162b88...) does **not** appear in `drbdsetup status` on any satellite when we run cleanup, so we never run `drbdsetup down` for it.

```
=== DRBD cleanup: namespace <privileged-namespace> ===
--- Pod: linstor-satellite.node-a-... ---
DRBD status (before cleanup):
# No currently configured DRBD found.
Pod linstor-satellite.node-a-...: DRBD cleanup done
--- Pod: linstor-satellite.node-b-... ---
DRBD status (before cleanup):
pvc-other-volume-id role:Secondary
  disk:Inconsistent open:no
  node-c connection:StandAlone

Pod linstor-satellite.node-b-...: drbdsetup down pvc-other-volume-id
  -> down OK
Pod linstor-satellite.node-b-...: DRBD cleanup done
--- Pod: linstor-satellite.node-c-... ---
DRBD status (before cleanup):
pvc-other-volume-id role:Secondary
  disk:UpToDate open:no
  node-b connection:StandAlone

Pod linstor-satellite.node-c-...: drbdsetup down pvc-other-volume-id
  -> down OK
Pod linstor-satellite.node-c-...: DRBD cleanup done
=== DRBD cleanup (down) completed ===

=== DRBD status after cleanup (all satellite pods) ===
--- Pod: linstor-satellite.node-a-... ---
# No currently configured DRBD found.
--- Pod: linstor-satellite.node-b-... ---
# No currently configured DRBD found.
--- Pod: linstor-satellite.node-c-... ---
# No currently configured DRBD found.
=== DRBD wait completed ===
```

---

## Questions

1. **Uninstall order:** We run **DRBD cleanup after** deleting the NFS server and NFS provisioner PVCs. Is this order correct, or should `drbdsetup down` run **before** we delete those PVCs? What is the recommended sequence?

2. **Why does deletion fail?** What causes **"Tie breaker marked for deletion"** and **"Failed to create meta-data for DRBD volume .../0"** on the secondary nodes during volume delete?

3. **DRBD not visible on one node:** When we run DRBD cleanup (after the pod using the volume is gone), we see **"No currently configured DRBD found"** on that node for this volume, so we cannot run `drbdsetup down` for it. Should the CSI/Linstor controller still be able to delete the volume in this case? Is there a recommended way to clean up such a volume (e.g. Linstor CLI)?

4. **Recommended flow:** What is the **recommended** sequence (workload deletion, PVC deletion, DRBD cleanup, Linstor teardown) so that PVs do not get stuck in Released?

Thank you.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PV stuck in released state after uninstalling linstor cluster #246

PV stuck in Released state after uninstall (Linstor CSI)

Summary

Environment

Observed behaviour

Uninstall order we use (original)

PV describe output (events)

DRBD cleanup output (during uninstall)

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PV stuck in released state after uninstalling linstor cluster #246

Description

PV stuck in Released state after uninstall (Linstor CSI)

Summary

Environment

Observed behaviour

Uninstall order we use (original)

PV describe output (events)

DRBD cleanup output (during uninstall)

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions