Skip to content

PV stuck in released state after uninstalling linstor cluster #246

@saarthak18

Description

@saarthak18

PV stuck in Released state after uninstall (Linstor CSI)

Summary

When uninstalling workloads that use Linstor CSI (piraeus-storage StorageClass), a PersistentVolume remains in Released state. The Linstor CSI controller reports VolumeFailedDelete with tie-breaker and DRBD meta-data errors. Our uninstall runs DRBD cleanup (drbdsetup down) after deleting the PVCs (including the one used by NFS server and NFS provisioner). We would like to know if this order is correct and what the recommended sequence is.


Environment

  • Storage: Linstor CSI (piraeus-storage StorageClass).
  • Reclaim policy: Delete.
  • Cluster: Multi-node; Linstor satellites on several nodes.
  • Affected PVC: A PVC used by an NFS server (backed by Piraeus Storage Class). Another PVC is used by an NFS subdir provisioner.

Observed behaviour

  1. After deployment, the PV is Bound .
    but gives this error
 Normal   Scheduled           2m11s                default-scheduler        Successfully assigned nsp-psa-privileged/nfs-server-5c98544f8-fn4dp to leek-node5
  Warning  FailedAttachVolume  64s (x8 over 2m11s)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-6f162b88-f2a7-47f7-b0ee-2b3b92241c19" : rpc error: code = Internal desc = ControllerPublishVolume failed for pvc-6f162b88-f2a7-47f7-b0ee-2b3b92241c19: could not determine device path
  1. After uninstall, the PV moves to Released and never gets deleted.
  2. kubectl describe pv shows repeated VolumeFailedDelete events from the Linstor CSI controller.
  3. Re-deploying and uninstalling again reproduces the issue; the PV remains Released.

Uninstall order we use (original)

We uninstall in this order:

  1. Delete application workloads and their PVCs (except the NFS server PVC and NFS provisioner PVC).
  2. Wait for those PVs to be cleaned up.
  3. Delete the NFS server and NFS provisioner workloads (pods are gone; the Linstor-backed PVCs are no longer in use).
  4. Delete all remaining PVCs, including the NFS server PVC and NFS provisioner PVC.
  5. DRBD cleanup: For each Linstor satellite pod, run drbdsetup status, then drbdsetup down <resource> for each pvc-* resource shown. Remove lost-quorum taint. Wait until no pvc-* DRBD resources remain.
  6. Linstor cleanup: Remove Linstor cluster and Piraeus operator (CSI controller is removed here).

So we run DRBD cleanup (step 5) after we have already deleted the NFS server and NFS provisioner PVCs (step 4).

Question: Is this order correct? Should DRBD cleanup (drbdsetup down) run before we delete the NFS server PVC and NFS provisioner PVC, or is it correct to run it after those PVCs are deleted? What is the recommended uninstall sequence when using Linstor CSI with workloads (e.g. NFS server) that use Linstor-backed PVCs?


PV describe output (events)

Warning  VolumeFailedDelete  37m  linstor.csi.linbit.com_linstor-csi-controller-...  rpc error: code = Internal desc = failed to delete volume: Message: 'Tie breaker marked for deletion' next error: Message: 'Node: node-a, Resource: pvc-6f162b88-f2a7-47f7-b0ee-2b3b92241c19 preparing for deletion.'; Details: 'Node: node-a, Resource: pvc-6f162b88-f2a7-47f7-b0ee-2b3b92241c19 UUID is: ed46abec-7a4e-4f9c-b41e-3d27866994a0' next error: Message: 'Preparing deletion of resource on 'node-a'' next error: Message: '(node-b) Failed to create meta-data for DRBD volume pvc-6f162b88-f2a7-47f7-b0ee-2b3b92241c19/0'; Reports: '[69A55B3B-D89E6-000019]' next error: Message: '(node-c) Failed to create meta-data for DRBD volume pvc-6f162b88-f2a7-47f7-b0ee-2b3b92241c19/0'; Reports: '[69A55B3E-AFF24-000020]' next error: Message: 'Deletion of resource 'pvc-6f162b88-f2a7-47f7-b0ee-2b3b92241c19' on node 'node-a' failed due to an unhandled exception of type DelayedApiRcException. Exceptions have been converted to responses'; Details: 'Node: node-a, Resource: pvc-6f162b88-f2a7-47f7-b0ee-2b3b92241c19'; Reports: '[69A55BB6-00000-000008]'
  • PV name: pvc-6f162b88-f2a7-47f7-b0ee-2b3b92241c19
  • StorageClass: piraeus-storage
  • Note: node-a, node-b, node-c in the message are generic placeholders for the actual node names.

DRBD cleanup output (during uninstall)

When we run DRBD cleanup (after the PVCs have been deleted), we see:

  • On one satellite node (the one where the NFS server pod had run): drbdsetup status shows "No currently configured DRBD found" — so the volume does not appear there when we run cleanup.
  • On two other satellite nodes: drbdsetup status shows a different PV resource (another volume). We run drbdsetup down for that one successfully.
  • The stuck volume (pvc-6f162b88...) does not appear in drbdsetup status on any satellite when we run cleanup, so we never run drbdsetup down for it.
=== DRBD cleanup: namespace <privileged-namespace> ===
--- Pod: linstor-satellite.node-a-... ---
DRBD status (before cleanup):
# No currently configured DRBD found.
Pod linstor-satellite.node-a-...: DRBD cleanup done
--- Pod: linstor-satellite.node-b-... ---
DRBD status (before cleanup):
pvc-other-volume-id role:Secondary
  disk:Inconsistent open:no
  node-c connection:StandAlone

Pod linstor-satellite.node-b-...: drbdsetup down pvc-other-volume-id
  -> down OK
Pod linstor-satellite.node-b-...: DRBD cleanup done
--- Pod: linstor-satellite.node-c-... ---
DRBD status (before cleanup):
pvc-other-volume-id role:Secondary
  disk:UpToDate open:no
  node-b connection:StandAlone

Pod linstor-satellite.node-c-...: drbdsetup down pvc-other-volume-id
  -> down OK
Pod linstor-satellite.node-c-...: DRBD cleanup done
=== DRBD cleanup (down) completed ===

=== DRBD status after cleanup (all satellite pods) ===
--- Pod: linstor-satellite.node-a-... ---
# No currently configured DRBD found.
--- Pod: linstor-satellite.node-b-... ---
# No currently configured DRBD found.
--- Pod: linstor-satellite.node-c-... ---
# No currently configured DRBD found.
=== DRBD wait completed ===

Questions

  1. Uninstall order: We run DRBD cleanup after deleting the NFS server and NFS provisioner PVCs. Is this order correct, or should drbdsetup down run before we delete those PVCs? What is the recommended sequence?

  2. Why does deletion fail? What causes "Tie breaker marked for deletion" and "Failed to create meta-data for DRBD volume .../0" on the secondary nodes during volume delete?

  3. DRBD not visible on one node: When we run DRBD cleanup (after the pod using the volume is gone), we see "No currently configured DRBD found" on that node for this volume, so we cannot run drbdsetup down for it. Should the CSI/Linstor controller still be able to delete the volume in this case? Is there a recommended way to clean up such a volume (e.g. Linstor CLI)?

  4. Recommended flow: What is the recommended sequence (workload deletion, PVC deletion, DRBD cleanup, Linstor teardown) so that PVs do not get stuck in Released?

Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions