We've tested local disk deletion after expunging disks on racklets and it appears to be working just fine since the disk is logically deleted in the database and disk accounting shows reduced utilization. However, the disk-delete saga actually never entirely completes as it is unable to delete the local volume backend. The saga could stay running indefinitely until the nexus zone is gone and doesn't really affect anything else - till we need to quiesce it. And we have such a situation today on rack2. Unlike racklets, the system is not wiped clean and is mostly upgraded via nexus-driven update so we only realize this blocking situation for the first time.
root@oxz_switch0:~# omdb nexus background-tasks show blueprint_planner 2>/dev/null
task: "blueprint_planner"
configured period: every 1m
currently executing: no
last completed activation: iter 1162, triggered by a dependent task completing
started at 2026-04-04T01:18:19.362Z (36s ago) and ran for 687ms
plan unchanged from parent 29fe25f9-db93-4c3e-9e34-dbe2564758a5
note: 3255/5000 blueprints in database
planning report:
* 3 remaining out-of-date zones
* 3 zones waiting to be expunged:
* zone 03ee5ea0-a003-4ff3-9125-bf54d41b1868: nexus image out-of-date, but nexus_generation 46 is still active
* zone 84be6867-c3b1-4f54-92c8-1ba3390a9ff7: nexus image out-of-date, but nexus_generation 46 is still active
* zone f0d6e08d-0cdb-4f83-8f50-6115f2ebfb84: nexus image out-of-date, but nexus_generation 46 is still active
* will ensure cockroachdb setting: "22.1"
01:22:35.803Z WARN 03ee5ea0-a003-4ff3-9125-bf54d41b1868 (ServerContext): not yet quiesced
error = at least one Nexus instance is not drained as of blueprint Some(29fe25f9-db93-4c3e-9e34-dbe2564758a5 (blueprint)): Nexus 84be6867-c3b1-4f54-92c8-1ba3390a9ff7
file = nexus/src/app/quiesce.rs:193
root@oxz_switch0:~# OMDB_NEXUS_URL=http://[fd00:1122:3344:108::6e]:12232 omdb nexus quiesce show
note: using Nexus URL http://[fd00:1122:3344:108::6e]:12232
quiescing since 2026-04-03T23:20:51.574Z (2h 7m 42s 219ms ago)
details: waiting for running sagas to finish
saga quiesce:
new sagas: DisallowedQuiesce
drained as of blueprint: none
blueprint for last completed recovery pass: 29fe25f9-db93-4c3e-9e34-dbe2564758a5
blueprint for last reassignment pass: 29fe25f9-db93-4c3e-9e34-dbe2564758a5
reassignment generation: 1 (pass running: no)
recovered generation: 1
recovered at least once successfully: yes
recovery pending: no
sagas running: 1
saga 16933790-0e7c-402a-bb4e-7f95b662023d pending since 2026-04-03T20:45:16.673Z (disk-delete)
oot@oxz_switch0:~# omdb db saga show 16933790-0e7c-402a-bb4e-7f95b662023d 2>/dev/null
id | current_sec | time_created | name | state
--------------------------------------+--------------------------------------+--------------------------+-------------+---------
16933790-0e7c-402a-bb4e-7f95b662023d | 84be6867-c3b1-4f54-92c8-1ba3390a9ff7 | 2026-04-01T21:08:26.353Z | disk-delete | Running
DAG: {"end_node":5,"graph":{"edge_property":"directed","edges":[[0,1,null],[1,2,null],[2,3,null],[4,0,null],[3,5,null]],"node_holes":[],"nodes":[{"Action":{"action_name":"disk_delete.delete_disk_record","label":"DeleteDiskRecord","name":"deleted_disk"}},{"Action":{"action_name":"disk_delete.space_account","label":"SpaceAccount","name":"no_result1"}},{"Action":{"action_name":"disk_delete.delete_local_storage","label":"DeleteLocalStorage","name":"delete_local_storage"}},{"Action":{"action_name":"disk_delete.deallocate_local_storage","label":"DeallocateLocalStorage","name":"deallocate_local_storage"}},{"Start":{"params":{"disk":{"LocalStorage":{"disk":{"attach_instance_id":null,"block_size":"AdvancedFormat","disk_state":"detached","disk_type":"LocalStorage","identity":{"description":"application ephemeral data disk","id":"ce87e9f1-20c3-4ba8-bcae-a97386854284","name":"app-data2-17","time_created":"2026-04-01T18:22:33.224012Z","time_deleted":null,"time_modified":"2026-04-01T18:22:33.224012Z"},"project_id":"fe0da422-5c48-4b52-8010-f2fc401f090f","rcgen":1,"size":53687091200,"slot":null,"state_generation":2,"time_state_updated":"2026-04-01T18:22:33.326067Z"},"disk_type_local_storage":{"disk_id":"ce87e9f1-20c3-4ba8-bcae-a97386854284","local_storage_dataset_allocation_id":null,"local_storage_unencrypted_dataset_allocation_id":"f6f9b6a1-3322-48af-b74a-e00b49d33eeb","required_dataset_overhead":3670016000},"local_storage_dataset_allocation":{"Unencrypted":{"dataset_size":57357107200,"id":"f6f9b6a1-3322-48af-b74a-e00b49d33eeb","local_storage_unencrypted_dataset_id":"d0f6c1a2-fbee-4a39-b861-3fdce10e475a","pool_id":"f522118c-5dcd-4116-8044-07f0cceec52e","sled_id":"87c2c4fc-b0c7-4fef-a305-78f0ed265bbc","time_created":"2026-04-01T18:23:07.637294Z","time_deleted":null}}}},"project_id":"fe0da422-5c48-4b52-8010-f2fc401f090f","serialized_authn":{"kind":{"Authenticated":[{"actor":{"SiloUser":{"silo_id":"7bd7623a-68ed-4636-8ecb-b59e3b068787","silo_user_id":"906e74cb-eab8-4a87-bda8-2cb0914bf853"}},"credential_id":"a84b123a-cad9-4c49-8d89-e67a0356d843","device_token_expiration":null},{"mapped_fleet_roles":{"admin":["admin"]}}]}}}}},"End"]},"saga_name":"disk-delete","start_node":4}
event time | sub saga | node id | event type | data
------------------------ | -------- | ------------------------------------- | ---------- | ---
2026-04-01T21:08:26.357Z | | 4: start | started |
2026-04-01T21:08:26.362Z | | 4: start | succeeded |
2026-04-01T21:08:26.366Z | | 0: disk_delete.delete_disk_record | started |
2026-04-01T21:08:26.374Z | | 0: disk_delete.delete_disk_record | succeeded | "deleted_disk" => {"attach_instance_id":null,"block_size":"AdvancedFormat","disk_state":"detached","disk_type":"LocalStorage","identity":{"description":"application ephemeral data disk","id":"ce87e9f1-20c3-4ba8-bcae-a97386854284","name":"app-data2-17","time_created":"2026-04-01T18:22:33.224012Z","time_deleted":null,"time_modified":"2026-04-01T18:22:33.224012Z"},"project_id":"fe0da422-5c48-4b52-8010-f2fc401f090f","rcgen":1,"size":53687091200,"slot":null,"state_generation":2,"time_state_updated":"2026-04-01T18:22:33.326067Z"}
2026-04-01T21:08:26.376Z | | 1: disk_delete.space_account | started |
2026-04-01T21:08:26.396Z | | 1: disk_delete.space_account | succeeded |
2026-04-01T21:08:26.400Z | | 2: disk_delete.delete_local_storage | started |
We've tested local disk deletion after expunging disks on racklets and it appears to be working just fine since the disk is logically deleted in the database and disk accounting shows reduced utilization. However, the
disk-deletesaga actually never entirely completes as it is unable to delete the local volume backend. The saga could stay running indefinitely until the nexus zone is gone and doesn't really affect anything else - till we need to quiesce it. And we have such a situation today on rack2. Unlike racklets, the system is not wiped clean and is mostly upgraded via nexus-driven update so we only realize this blocking situation for the first time.I saw that final nexus handoff step in online update was not making progress:
and there were warnings about some nexus instances not drained
@jgallagher found out that:
This is the current state of the saga: