Skip to content

sagas hang after local disk deletion from expunged disk #10222

@askfongjojo

Description

@askfongjojo

We've tested local disk deletion after expunging disks on racklets and it appears to be working just fine since the disk is logically deleted in the database and disk accounting shows reduced utilization. However, the disk-delete saga actually never entirely completes as it is unable to delete the local volume backend. The saga could stay running indefinitely until the nexus zone is gone and doesn't really affect anything else - till we need to quiesce it. And we have such a situation today on rack2. Unlike racklets, the system is not wiped clean and is mostly upgraded via nexus-driven update so we only realize this blocking situation for the first time.

I saw that final nexus handoff step in online update was not making progress:

root@oxz_switch0:~# omdb nexus background-tasks show blueprint_planner 2>/dev/null
task: "blueprint_planner"
  configured period: every 1m
  currently executing: no
  last completed activation: iter 1162, triggered by a dependent task completing
    started at 2026-04-04T01:18:19.362Z (36s ago) and ran for 687ms
    plan unchanged from parent 29fe25f9-db93-4c3e-9e34-dbe2564758a5
    note: 3255/5000 blueprints in database
planning report:
* 3 remaining out-of-date zones
* 3 zones waiting to be expunged:
  * zone 03ee5ea0-a003-4ff3-9125-bf54d41b1868: nexus image out-of-date, but nexus_generation 46 is still active
  * zone 84be6867-c3b1-4f54-92c8-1ba3390a9ff7: nexus image out-of-date, but nexus_generation 46 is still active
  * zone f0d6e08d-0cdb-4f83-8f50-6115f2ebfb84: nexus image out-of-date, but nexus_generation 46 is still active
* will ensure cockroachdb setting: "22.1"

and there were warnings about some nexus instances not drained

01:22:35.803Z WARN 03ee5ea0-a003-4ff3-9125-bf54d41b1868 (ServerContext): not yet quiesced
    error = at least one Nexus instance is not drained as of blueprint Some(29fe25f9-db93-4c3e-9e34-dbe2564758a5 (blueprint)): Nexus 84be6867-c3b1-4f54-92c8-1ba3390a9ff7
    file = nexus/src/app/quiesce.rs:193

@jgallagher found out that:

root@oxz_switch0:~# OMDB_NEXUS_URL=http://[fd00:1122:3344:108::6e]:12232 omdb nexus quiesce show
note: using Nexus URL http://[fd00:1122:3344:108::6e]:12232
quiescing since 2026-04-03T23:20:51.574Z (2h 7m 42s 219ms ago)
details: waiting for running sagas to finish
saga quiesce:
    new sagas: DisallowedQuiesce
    drained as of blueprint: none
    blueprint for last completed recovery pass: 29fe25f9-db93-4c3e-9e34-dbe2564758a5
    blueprint for last reassignment pass: 29fe25f9-db93-4c3e-9e34-dbe2564758a5
    reassignment generation: 1 (pass running: no)
    recovered generation: 1
    recovered at least once successfully: yes
    recovery pending: no
    sagas running: 1
        saga 16933790-0e7c-402a-bb4e-7f95b662023d pending since 2026-04-03T20:45:16.673Z (disk-delete)

This is the current state of the saga:

oot@oxz_switch0:~# omdb db saga show 16933790-0e7c-402a-bb4e-7f95b662023d 2>/dev/null
 id                                   | current_sec                          | time_created             | name        | state   
--------------------------------------+--------------------------------------+--------------------------+-------------+---------
 16933790-0e7c-402a-bb4e-7f95b662023d | 84be6867-c3b1-4f54-92c8-1ba3390a9ff7 | 2026-04-01T21:08:26.353Z | disk-delete | Running 

DAG: {"end_node":5,"graph":{"edge_property":"directed","edges":[[0,1,null],[1,2,null],[2,3,null],[4,0,null],[3,5,null]],"node_holes":[],"nodes":[{"Action":{"action_name":"disk_delete.delete_disk_record","label":"DeleteDiskRecord","name":"deleted_disk"}},{"Action":{"action_name":"disk_delete.space_account","label":"SpaceAccount","name":"no_result1"}},{"Action":{"action_name":"disk_delete.delete_local_storage","label":"DeleteLocalStorage","name":"delete_local_storage"}},{"Action":{"action_name":"disk_delete.deallocate_local_storage","label":"DeallocateLocalStorage","name":"deallocate_local_storage"}},{"Start":{"params":{"disk":{"LocalStorage":{"disk":{"attach_instance_id":null,"block_size":"AdvancedFormat","disk_state":"detached","disk_type":"LocalStorage","identity":{"description":"application ephemeral data disk","id":"ce87e9f1-20c3-4ba8-bcae-a97386854284","name":"app-data2-17","time_created":"2026-04-01T18:22:33.224012Z","time_deleted":null,"time_modified":"2026-04-01T18:22:33.224012Z"},"project_id":"fe0da422-5c48-4b52-8010-f2fc401f090f","rcgen":1,"size":53687091200,"slot":null,"state_generation":2,"time_state_updated":"2026-04-01T18:22:33.326067Z"},"disk_type_local_storage":{"disk_id":"ce87e9f1-20c3-4ba8-bcae-a97386854284","local_storage_dataset_allocation_id":null,"local_storage_unencrypted_dataset_allocation_id":"f6f9b6a1-3322-48af-b74a-e00b49d33eeb","required_dataset_overhead":3670016000},"local_storage_dataset_allocation":{"Unencrypted":{"dataset_size":57357107200,"id":"f6f9b6a1-3322-48af-b74a-e00b49d33eeb","local_storage_unencrypted_dataset_id":"d0f6c1a2-fbee-4a39-b861-3fdce10e475a","pool_id":"f522118c-5dcd-4116-8044-07f0cceec52e","sled_id":"87c2c4fc-b0c7-4fef-a305-78f0ed265bbc","time_created":"2026-04-01T18:23:07.637294Z","time_deleted":null}}}},"project_id":"fe0da422-5c48-4b52-8010-f2fc401f090f","serialized_authn":{"kind":{"Authenticated":[{"actor":{"SiloUser":{"silo_id":"7bd7623a-68ed-4636-8ecb-b59e3b068787","silo_user_id":"906e74cb-eab8-4a87-bda8-2cb0914bf853"}},"credential_id":"a84b123a-cad9-4c49-8d89-e67a0356d843","device_token_expiration":null},{"mapped_fleet_roles":{"admin":["admin"]}}]}}}}},"End"]},"saga_name":"disk-delete","start_node":4}

              event time | sub saga | node id                               | event type | data
------------------------ | -------- | ------------------------------------- | ---------- | ---
2026-04-01T21:08:26.357Z |          |                              4: start | started    | 
2026-04-01T21:08:26.362Z |          |                              4: start | succeeded  | 
2026-04-01T21:08:26.366Z |          |     0: disk_delete.delete_disk_record | started    | 
2026-04-01T21:08:26.374Z |          |     0: disk_delete.delete_disk_record | succeeded  | "deleted_disk" => {"attach_instance_id":null,"block_size":"AdvancedFormat","disk_state":"detached","disk_type":"LocalStorage","identity":{"description":"application ephemeral data disk","id":"ce87e9f1-20c3-4ba8-bcae-a97386854284","name":"app-data2-17","time_created":"2026-04-01T18:22:33.224012Z","time_deleted":null,"time_modified":"2026-04-01T18:22:33.224012Z"},"project_id":"fe0da422-5c48-4b52-8010-f2fc401f090f","rcgen":1,"size":53687091200,"slot":null,"state_generation":2,"time_state_updated":"2026-04-01T18:22:33.326067Z"}
2026-04-01T21:08:26.376Z |          |          1: disk_delete.space_account | started    | 
2026-04-01T21:08:26.396Z |          |          1: disk_delete.space_account | succeeded  | 
2026-04-01T21:08:26.400Z |          |   2: disk_delete.delete_local_storage | started    | 

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions