Skip to content

Do orphaned inode extent freeing in chunks#231

Open
chriskirby1 wants to merge 1 commit intomainfrom
clk/truncate_breakup
Open

Do orphaned inode extent freeing in chunks#231
chriskirby1 wants to merge 1 commit intomainfrom
clk/truncate_breakup

Conversation

@chriskirby1
Copy link
Contributor

Fix an issue where the final freeing of extents for unlinked files could trigger spurious hung task timeout warnings. Instead of holding the scoutfs inode lock for the entire duration of extent freeing, release and reacquire the lock every time the transaction sequence number changes.

This is only used in the evict and orphan inode cleanup paths; truncate and release continue to free during a single lock hold.

Remove the hung_task_timeout_secs workaround from the large-fragmented-free test script.

@chriskirby1
Copy link
Contributor Author

retest

@chriskirby1 chriskirby1 force-pushed the clk/truncate_breakup branch from 497372d to 809dcde Compare June 11, 2025 14:38
@chriskirby1
Copy link
Contributor Author

The test failures are either hung task timeouts on debug kernels or the try_drain_data_freed commit budget issue that's fixed by another PR. The hung task complaints are gone from non-debug kernels. This change also keeps what could be background cleanup work from blocking access to an entire inode shard, so I'm keeping the PR open.

@chriskirby1 chriskirby1 requested review from aversecat and zabbo June 11, 2025 17:34
@aversecat
Copy link
Contributor

I would put an empty newline before any s-o-b lines, for clarity. Having them right after the last sentence of the commit message makes them look munged.

Fix an issue where the final freeing of extents for unlinked files
could trigger spurious hung task timeout warnings. Instead of holding the
scoutfs inode lock for the entire duration of extent freeing, release and
reacquire the lock every time the transaction sequence number changes.

This is only used in the evict and orphan inode cleanup paths; truncate
and release continue to free during a single lock hold.

Remove the hung_task_timeout_secs workaround from the large-fragmented-free
test script.

Signed-off-by: Chris Kirby <ckirby@versity.com>
@chriskirby1 chriskirby1 force-pushed the clk/truncate_breakup branch from 809dcde to 417261a Compare June 11, 2025 18:24
@aversecat
Copy link
Contributor

I'm concerned that we still have not solved the actual issue here. I've hit the task hung timeout with v1.24-21-g497372d (which is your version before the rebase) and all 4 orphan workers are stuck for 120s even despite the total test took 453s (so not slower at all).

Even if we are going to drastically allow hung tasks to take much longer, it's still a sign that something is locked semi permanently while the goal of this patch is to lock/unlock every second or so... right?

not ok 19 - large-fragmented-free
 # large-fragmented-free ** failed **
 #
 # dmesg:
 #
 # [37970.881552] INFO: task kworker/14:1:2464557 blocked for more than 122 seconds.
 # [37970.885547]       Tainted: G           OE     -------  ---  5.14.0-503.35.1.el9_5.x86_64+debug #1
 # [37970.888859] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 # [37970.892168] task:kworker/14:1    state:D stack:22880 pid:2464557 tgid:2464557 ppid:2      flags:0x00004000
 # [37970.892185] Workqueue: events inode_orphan_scan_worker [scoutfs]
 # [37970.892257] Call Trace:
 # [37970.892261]  <TASK>
 # [37970.892272]  __schedule+0x778/0x1600
 # [37970.892287]  ? __pfx___schedule+0x10/0x10
 # [37970.892306]  ? schedule+0x24c/0x300
 # [37970.892318]  schedule+0xe8/0x300
 # [37970.892327]  lock_key_range+0x848/0x1740 [scoutfs]
 # [37970.892403]  ? __pfx_lock_key_range+0x10/0x10 [scoutfs]
 # [37970.892475]  ? __pfx_autoremove_wake_function+0x10/0x10
 # [37970.892494]  ? mark_lock.part.0+0xca/0xa50
 # [37970.892503]  ? __pfx_mark_lock.part.0+0x10/0x10
 # [37970.892515]  scoutfs_lock_ino+0x12e/0x170 [scoutfs]
 # [37970.892576]  ? __pfx_scoutfs_lock_ino+0x10/0x10 [scoutfs]
 # [37970.892646]  ? _raw_spin_unlock_irq+0x2f/0x50
 # [37970.892653]  ? __wait_for_common+0x2c6/0x5d0
 # [37970.892659]  ? __pfx_schedule_timeout+0x10/0x10
 # [37970.892672]  try_delete_inode_items.isra.0+0x123/0x480 [scoutfs]
 # [37970.892729]  ? submit_send.isra.0+0x5ae/0x940 [scoutfs]
 # [37970.892794]  ? __pfx_try_delete_inode_items.isra.0+0x10/0x10 [scoutfs]
 # [37970.892865]  ? __pfx_scoutfs_net_sync_request+0x10/0x10 [scoutfs]
 # [37970.892939]  ? __wait_for_common+0x3e3/0x5d0
 # [37970.892950]  ? scoutfs_omap_test+0x204/0x320 [scoutfs]
 # [37970.893013]  ? __pfx_scoutfs_omap_test+0x10/0x10 [scoutfs]
 # [37970.893079]  ? scoutfs_client_open_ino_map+0xf4/0x170 [scoutfs]
 # [37970.893136]  ? __pfx_scoutfs_client_open_ino_map+0x10/0x10 [scoutfs]
 # [37970.893198]  inode_orphan_scan_worker+0x4cc/0x720 [scoutfs]
 # [37970.893265]  ? __pfx_inode_orphan_scan_worker+0x10/0x10 [scoutfs]
 # [37970.893327]  ? local_clock_noinstr+0x9/0xc0
 # [37970.893394]  process_one_work+0x89e/0x1600
 # [37970.893425]  ? __lock_acquired+0x207/0x800
 # [37970.893434]  ? __pfx_process_one_work+0x10/0x10
 # [37970.893453]  ? assign_work+0x16c/0x240
 # [37970.893466]  worker_thread+0x586/0xd20
 # [37970.893485]  ? __pfx_worker_thread+0x10/0x10
 # [37970.893495]  kthread+0x2ab/0x360
 # [37970.893501]  ? _raw_spin_unlock_irq+0x24/0x50
 # [37970.893507]  ? __pfx_kthread+0x10/0x10
 # [37970.893516]  ret_from_fork+0x29/0x50
 # [37970.893540]  </TASK>
 # [37970.893547] INFO: task kworker/6:0:2593030 blocked for more than 122 seconds.
 # [37970.896847]       Tainted: G           OE     -------  ---  5.14.0-503.35.1.el9_5.x86_64+debug #1
 # [37970.900154] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 # [37970.903459] task:kworker/6:0     state:D stack:22456 pid:2593030 tgid:2593030 ppid:2      flags:0x00004000
 # [37970.903474] Workqueue: events inode_orphan_scan_worker [scoutfs]
 # [37970.903538] Call Trace:
 # [37970.903542]  <TASK>
 # [37970.903552]  __schedule+0x778/0x1600
 # [37970.903565]  ? __pfx___schedule+0x10/0x10
 # [37970.903584]  ? schedule+0x24c/0x300
 # [37970.903596]  schedule+0xe8/0x300
 # [37970.903604]  lock_key_range+0x848/0x1740 [scoutfs]
 # [37970.903679]  ? __pfx_lock_key_range+0x10/0x10 [scoutfs]
 # [37970.903741]  ? __pfx_autoremove_wake_function+0x10/0x10
 # [37970.903758]  ? mark_lock.part.0+0xca/0xa50
 # [37970.903767]  ? __pfx_mark_lock.part.0+0x10/0x10
 # [37970.903779]  scoutfs_lock_ino+0x12e/0x170 [scoutfs]
 # [37970.903840]  ? __pfx_scoutfs_lock_ino+0x10/0x10 [scoutfs]
 # [37970.903910]  ? _raw_spin_unlock_irq+0x2f/0x50
 # [37970.903917]  ? __wait_for_common+0x2c6/0x5d0
 # [37970.903923]  ? __pfx_schedule_timeout+0x10/0x10
 # [37970.903936]  try_delete_inode_items.isra.0+0x123/0x480 [scoutfs]
 # [37970.903993]  ? submit_send.isra.0+0x5ae/0x940 [scoutfs]
 # [37970.904059]  ? __pfx_try_delete_inode_items.isra.0+0x10/0x10 [scoutfs]
 # [37970.904130]  ? __pfx_scoutfs_net_sync_request+0x10/0x10 [scoutfs]
 # [37970.904205]  ? __wait_for_common+0x3e3/0x5d0
 # [37970.904215]  ? scoutfs_omap_test+0x204/0x320 [scoutfs]
 # [37970.904279]  ? __pfx_scoutfs_omap_test+0x10/0x10 [scoutfs]
 # [37970.904344]  ? scoutfs_client_open_ino_map+0xf4/0x170 [scoutfs]
 # [37970.904402]  ? __pfx_scoutfs_client_open_ino_map+0x10/0x10 [scoutfs]
 # [37970.904476]  inode_orphan_scan_worker+0x4cc/0x720 [scoutfs]
 # [37970.904544]  ? __pfx_inode_orphan_scan_worker+0x10/0x10 [scoutfs]
 # [37970.904607]  ? local_clock_noinstr+0x9/0xc0
 # [37970.904672]  process_one_work+0x89e/0x1600
 # [37970.904691]  ? __lock_acquired+0x207/0x800
 # [37970.904699]  ? __pfx_process_one_work+0x10/0x10
 # [37970.904718]  ? assign_work+0x16c/0x240
 # [37970.904730]  worker_thread+0x586/0xd20
 # [37970.904749]  ? __pfx_worker_thread+0x10/0x10
 # [37970.904759]  kthread+0x2ab/0x360
 # [37970.904765]  ? _raw_spin_unlock_irq+0x24/0x50
 # [37970.904771]  ? __pfx_kthread+0x10/0x10
 # [37970.904780]  ret_from_fork+0x29/0x50
 # [37970.904803]  </TASK>
 # [37970.904817] INFO: task kworker/2:2:2661545 blocked for more than 122 seconds.
 # [37970.908126]       Tainted: G           OE     -------  ---  5.14.0-503.35.1.el9_5.x86_64+debug #1
 # [37970.911431] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 # [37970.914703] task:kworker/2:2     state:D stack:22152 pid:2661545 tgid:2661545 ppid:2      flags:0x00004000
 # [37970.914717] Workqueue: events inode_orphan_scan_worker [scoutfs]
 # [37970.914780] Call Trace:
 # [37970.914784]  <TASK>
 # [37970.914793]  __schedule+0x778/0x1600
 # [37970.914807]  ? __pfx___schedule+0x10/0x10
 # [37970.914825]  ? schedule+0x24c/0x300
 # [37970.914837]  schedule+0xe8/0x300
 # [37970.914845]  lock_key_range+0x848/0x1740 [scoutfs]
 # [37970.914920]  ? __pfx_lock_key_range+0x10/0x10 [scoutfs]
 # [37970.914981]  ? __pfx_autoremove_wake_function+0x10/0x10
 # [37970.914998]  ? mark_lock.part.0+0xca/0xa50
 # [37970.915007]  ? __pfx_mark_lock.part.0+0x10/0x10
 # [37970.915019]  scoutfs_lock_ino+0x12e/0x170 [scoutfs]
 # [37970.915079]  ? __pfx_scoutfs_lock_ino+0x10/0x10 [scoutfs]
 # [37970.915149]  ? _raw_spin_unlock_irq+0x2f/0x50
 # [37970.915156]  ? __wait_for_common+0x2c6/0x5d0
 # [37970.915162]  ? __pfx_schedule_timeout+0x10/0x10
 # [37970.915175]  try_delete_inode_items.isra.0+0x123/0x480 [scoutfs]
 # [37970.915233]  ? submit_send.isra.0+0x5ae/0x940 [scoutfs]
 # [37970.915298]  ? __pfx_try_delete_inode_items.isra.0+0x10/0x10 [scoutfs]
 # [37970.915369]  ? __pfx_scoutfs_net_sync_request+0x10/0x10 [scoutfs]
 # [37970.915454]  ? __wait_for_common+0x3e3/0x5d0
 # [37970.915465]  ? scoutfs_omap_test+0x204/0x320 [scoutfs]
 # [37970.915529]  ? __pfx_scoutfs_omap_test+0x10/0x10 [scoutfs]
 # [37970.915594]  ? scoutfs_client_open_ino_map+0xf4/0x170 [scoutfs]
 # [37970.915652]  ? __pfx_scoutfs_client_open_ino_map+0x10/0x10 [scoutfs]
 # [37970.915715]  inode_orphan_scan_worker+0x4cc/0x720 [scoutfs]
 # [37970.915782]  ? __pfx_inode_orphan_scan_worker+0x10/0x10 [scoutfs]
 # [37970.915844]  ? local_clock_noinstr+0x9/0xc0
 # [37970.915909]  process_one_work+0x89e/0x1600
 # [37970.915928]  ? __lock_acquired+0x207/0x800
 # [37970.915936]  ? __pfx_process_one_work+0x10/0x10
 # [37970.915955]  ? assign_work+0x16c/0x240
 # [37970.915967]  worker_thread+0x586/0xd20
 # [37970.915986]  ? __pfx_worker_thread+0x10/0x10
 # [37970.915996]  kthread+0x2ab/0x360
 # [37970.916002]  ? _raw_spin_unlock_irq+0x24/0x50
 # [37970.916008]  ? __pfx_kthread+0x10/0x10
 # [37970.916017]  ret_from_fork+0x29/0x50
 # [37970.916040]  </TASK>
 # [37970.916049] INFO: task kworker/0:3:2674586 blocked for more than 122 seconds.
 # [37970.919320]       Tainted: G           OE     -------  ---  5.14.0-503.35.1.el9_5.x86_64+debug #1
 # [37970.922599] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 # [37970.925880] task:kworker/0:3     state:D stack:22880 pid:2674586 tgid:2674586 ppid:2      flags:0x00004000
 # [37970.925894] Workqueue: events inode_orphan_scan_worker [scoutfs]
 # [37970.925956] Call Trace:
 # [37970.925960]  <TASK>
 # [37970.925969]  __schedule+0x778/0x1600
 # [37970.925982]  ? __pfx___schedule+0x10/0x10
 # [37970.926001]  ? schedule+0x24c/0x300
 # [37970.926012]  schedule+0xe8/0x300
 # [37970.926021]  lock_key_range+0x848/0x1740 [scoutfs]
 # [37970.926096]  ? __pfx_lock_key_range+0x10/0x10 [scoutfs]
 # [37970.926157]  ? __pfx_autoremove_wake_function+0x10/0x10
 # [37970.926174]  ? mark_lock.part.0+0xca/0xa50
 # [37970.926183]  ? __pfx_mark_lock.part.0+0x10/0x10
 # [37970.926195]  scoutfs_lock_ino+0x12e/0x170 [scoutfs]
 # [37970.926256]  ? __pfx_scoutfs_lock_ino+0x10/0x10 [scoutfs]
 # [37970.926326]  ? _raw_spin_unlock_irq+0x2f/0x50
 # [37970.926333]  ? __wait_for_common+0x2c6/0x5d0
 # [37970.926339]  ? __pfx_schedule_timeout+0x10/0x10
 # [37970.926351]  try_delete_inode_items.isra.0+0x123/0x480 [scoutfs]
 # [37970.926409]  ? submit_send.isra.0+0x5ae/0x940 [scoutfs]
 # [37970.926485]  ? __pfx_try_delete_inode_items.isra.0+0x10/0x10 [scoutfs]
 # [37970.926557]  ? __pfx_scoutfs_net_sync_request+0x10/0x10 [scoutfs]
 # [37970.926632]  ? __wait_for_common+0x3e3/0x5d0
 # [37970.926642]  ? scoutfs_omap_test+0x204/0x320 [scoutfs]
 # [37970.926706]  ? __pfx_scoutfs_omap_test+0x10/0x10 [scoutfs]
 # [37970.926772]  ? scoutfs_client_open_ino_map+0xf4/0x170 [scoutfs]
 # [37970.926830]  ? __pfx_scoutfs_client_open_ino_map+0x10/0x10 [scoutfs]
 # [37970.926893]  inode_orphan_scan_worker+0x4cc/0x720 [scoutfs]
 # [37970.926959]  ? __pfx_inode_orphan_scan_worker+0x10/0x10 [scoutfs]
 # [37970.927021]  ? local_clock_noinstr+0x9/0xc0
 # [37970.927087]  process_one_work+0x89e/0x1600
 # [37970.927106]  ? __lock_acquired+0x207/0x800
 # [37970.927114]  ? __pfx_process_one_work+0x10/0x10
 # [37970.927132]  ? assign_work+0x16c/0x240
 # [37970.927145]  worker_thread+0x586/0xd20
 # [37970.927164]  ? __pfx_worker_thread+0x10/0x10
 # [37970.927174]  kthread+0x2ab/0x360
 # [37970.927179]  ? _raw_spin_unlock_irq+0x24/0x50
 # [37970.927185]  ? __pfx_kthread+0x10/0x10
 # [37970.927194]  ret_from_fork+0x29/0x50
 # [37970.927217]  </TASK>
 # [37970.927225] 
 #                Showing all locks held in the system:
 # [37970.927239] 1 lock held by khungtaskd/146:
 # [37970.927244]  #0: ffffffffb8f37740 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire.constprop.0+0x7/0x30
 # [37970.927282] 2 locks held by kworker/u65:1/1899590:
 # [37970.927287]  #0: ffff8883d2018158 ((wq_completion)scoutfs_quorum_work#4){+.+.}-{0:0}, at: process_one_work+0xefc/0x1600
 # [37970.927310]  #1: ffffc9000253fdc0 ((work_completion)(&qinf->work)){+.+.}-{0:0}, at: process_one_work+0x7ea/0x1600
 # [37970.927331] 2 locks held by kworker/u65:5/1901891:
 # [37970.927335]  #0: ffff888d8ee33958 ((wq_completion)scoutfs_quorum_work#5){+.+.}-{0:0}, at: process_one_work+0xefc/0x1600
 # [37970.927357]  #1: ffffc900091d7dc0 ((work_completion)(&qinf->work)){+.+.}-{0:0}, at: process_one_work+0x7ea/0x1600
 # [37970.927377] 2 locks held by kworker/u64:21/2357628:
 # [37970.927381]  #0: ffff8884e9b2c158 ((wq_completion)scoutfs_trans){+.+.}-{0:0}, at: process_one_work+0xefc/0x1600
 # [37970.927401]  #1: ffffc90022d4fdc0 ((work_completion)(&(&tri->write_work)->work)){+.+.}-{0:0}, at: process_one_work+0x7ea/0x1600
 # [37970.927432] 2 locks held by kworker/14:1/2464557:
 # [37970.927436]  #0: ffff888100063558 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0xefc/0x1600
 # [37970.927456]  #1: ffffc90024977dc0 ((work_completion)(&(&inf->orphan_scan_dwork)->work)){+.+.}-{0:0}, at: process_one_work+0x7ea/0x1600
 # [37970.927476] 2 locks held by kworker/6:0/2593030:
 # [37970.927480]  #0: ffff888100063558 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0xefc/0x1600
 # [37970.927499]  #1: ffffc900246e7dc0 ((work_completion)(&(&inf->orphan_scan_dwork)->work)){+.+.}-{0:0}, at: process_one_work+0x7ea/0x1600
 # [37970.927520] 2 locks held by kworker/u65:0/2659656:
 # [37970.927524]  #0: ffff88862ad5f158 ((wq_completion)scoutfs_quorum_work#6){+.+.}-{0:0}, at: process_one_work+0xefc/0x1600
 # [37970.927546]  #1: ffffc9000246fdc0 ((work_completion)(&qinf->work)){+.+.}-{0:0}, at: process_one_work+0x7ea/0x1600
 # [37970.927566] 2 locks held by kworker/u64:1/2659683:
 # [37970.927570]  #0: ffff888b338f7158 ((wq_completion)scoutfs_net_accepted#5){+.+.}-{0:0}, at: process_one_work+0xefc/0x1600
 # [37970.927592]  #1: ffffc90002c9fdc0 ((work_completion)(&conn->recv_work)){+.+.}-{0:0}, at: process_one_work+0x7ea/0x1600
 # [37970.927612] 2 locks held by kworker/u64:2/2659684:
 # [37970.927615]  #0: ffff888b338f2958 ((wq_completion)scoutfs_net_accepted#3){+.+.}-{0:0}, at: process_one_work+0xefc/0x1600
 # [37970.927637]  #1: ffffc90002ccfdc0 ((work_completion)(&conn->recv_work)){+.+.}-{0:0}, at: process_one_work+0x7ea/0x1600
 # [37970.927657] 2 locks held by kworker/u64:6/2659697:
 # [37970.927661]  #0: ffff8883d201e158 ((wq_completion)scoutfs_net_client#8){+.+.}-{0:0}, at: process_one_work+0xefc/0x1600
 # [37970.927683]  #1: ffffc90002d8fdc0 ((work_completion)(&conn->recv_work)){+.+.}-{0:0}, at: process_one_work+0x7ea/0x1600
 # [37970.927702] 2 locks held by kworker/u64:10/2659714:
 # [37970.927706]  #0: ffff888b338f7958 ((wq_completion)scoutfs_net_accepted#2){+.+.}-{0:0}, at: process_one_work+0xefc/0x1600
 # [37970.927728]  #1: ffffc9000368fdc0 ((work_completion)(&conn->recv_work)){+.+.}-{0:0}, at: process_one_work+0x7ea/0x1600
 # [37970.927747] 5 locks held by kworker/u64:12/2659715:
 # [37970.927752] 2 locks held by kworker/u65:4/2659747:
 # [37970.927756]  #0: ffff8882dc1de158 ((wq_completion)scoutfs_log_merge#4){+.+.}-{0:0}, at: process_one_work+0xefc/0x1600
 # [37970.927777]  #1: ffffc90003937dc0 ((work_completion)(&(&finf->log_merge_dwork)->work)){+.+.}-{0:0}, at: process_one_work+0x7ea/0x1600
 # [37970.927797] 2 locks held by kworker/u64:15/2660004:
 # [37970.927801]  #0: ffff888d8ee32158 ((wq_completion)scoutfs_server){+.+.}-{0:0}, at: process_one_work+0xefc/0x1600
 # [37970.927820]  #1: ffffc90004687dc0 ((work_completion)(&server->work)){+.+.}-{0:0}, at: process_one_work+0x7ea/0x1600
 # [37970.927840] 2 locks held by kworker/u64:19/2660200:
 # [37970.927844]  #0: ffff888df2a90158 ((wq_completion)scoutfs_net_client#7){+.+.}-{0:0}, at: process_one_work+0xefc/0x1600
 # [37970.927866]  #1: ffffc900053f7dc0 ((work_completion)(&conn->recv_work)){+.+.}-{0:0}, at: process_one_work+0x7ea/0x1600
 # [37970.927885] 2 locks held by kworker/u64:22/2660202:
 # [37970.927889]  #0: ffff888d8ee35158 ((wq_completion)scoutfs_net_client#9){+.+.}-{0:0}, at: process_one_work+0xefc/0x1600
 # [37970.927911]  #1: ffffc90005577dc0 ((work_completion)(&conn->recv_work)){+.+.}-{0:0}, at: process_one_work+0x7ea/0x1600
 # [37970.927931] 2 locks held by kworker/u64:24/2660283:
 # [37970.927935]  #0: ffff888b338f5158 ((wq_completion)scoutfs_net_accepted#4){+.+.}-{0:0}, at: process_one_work+0xefc/0x1600
 # [37970.927957]  #1: ffffc90005867dc0 ((work_completion)(&conn->recv_work)){+.+.}-{0:0}, at: process_one_work+0x7ea/0x1600
 # [37970.927976] 2 locks held by kworker/2:2/2661545:
 # [37970.927980]  #0: ffff888100063558 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0xefc/0x1600
 # [37970.927999]  #1: ffffc9000afe7dc0 ((work_completion)(&(&inf->orphan_scan_dwork)->work)){+.+.}-{0:0}, at: process_one_work+0x7ea/0x1600
 # [37970.928019] 2 locks held by kworker/u64:25/2670583:
 # [37970.928023]  #0: ffff88880f551958 ((wq_completion)scoutfs_net_client#6){+.+.}-{0:0}, at: process_one_work+0xefc/0x1600
 # [37970.928044]  #1: ffffc90022587dc0 ((work_completion)(&conn->recv_work)){+.+.}-{0:0}, at: process_one_work+0x7ea/0x1600
 # [37970.928064] 3 locks held by kworker/u64:26/2670584:
 # [37970.928068] 2 locks held by kworker/u64:27/2671915:
 # [37970.928072]  #0: ffff88862ad5e958 ((wq_completion)scoutfs_net_client#10){+.+.}-{0:0}, at: process_one_work+0xefc/0x1600
 # [37970.928094]  #1: ffffc9002617fdc0 ((work_completion)(&conn->recv_work)){+.+.}-{0:0}, at: process_one_work+0x7ea/0x1600
 # [37970.928114] 2 locks held by kworker/0:3/2674586:
 # [37970.928118]  #0: ffff888100063558 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0xefc/0x1600
 # [37970.928138]  #1: ffffc90006e47dc0 ((work_completion)(&(&inf->orphan_scan_dwork)->work)){+.+.}-{0:0}, at: process_one_work+0x7ea/0x1600
 # [37970.928157] 2 locks held by kworker/u64:28/2678228:
 # [37970.928161]  #0: ffff88818dcce958 ((wq_completion)scoutfs_net_server){+.+.}-{0:0}, at: process_one_work+0xefc/0x1600
 # [37970.928180]  #1: ffffc9000c1b7dc0 ((work_completion)(&conn->listen_work)){+.+.}-{0:0}, at: process_one_work+0x7ea/0x1600
 # [37970.928200] 2 locks held by kworker/u64:29/2678229:
 # [37970.928203]  #0: ffff888b338f3958 ((wq_completion)scoutfs_net_accepted){+.+.}-{0:0}, at: process_one_work+0xefc/0x1600
 # [37970.928223]  #1: ffffc9002698fdc0 ((work_completion)(&conn->recv_work)){+.+.}-{0:0}, at: process_one_work+0x7ea/0x1600
 # [37970.928243] 1 lock held by rm/2687643:
 # [37970.928247]  #0: ffff88837c25e480 (sb_writers#17){.+.+}-{0:0}, at: do_unlinkat+0x14e/0x550
 # 
 # [37970.928274] =============================================
 # 

@chriskirby1
Copy link
Contributor Author

I'm assuming this is still a debug kernel. When I looked at how many extents we were removing during each pass, the number fluctuated pretty wildly on a debug kernel. It was often taking 10+ seconds just to see the trans_seq change. Sometimes it was > 20s. And the worker thread wasn't doing a huge number of extent removals during that time- things just seemed to get very slow for a bit. Sometimes we would only remove 16 extents during that interval. Other times, it was tens of thousands.

On normal el95, it was a lot more consistent. I see the trans_seq moving forward pretty much once a second (tracing is only reporting at 1s granularity). To me that seems like we're just seeing the effects of the debug kernel, but if you think it's more than that, I'll investigate.

@zabbo
Copy link
Collaborator

zabbo commented Jun 12, 2025

I'm concerned that we still have not solved the actual issue here.

There can be more than one issue that's causing hung tasks. This can be a fine solution to its part while leaving more work to do. Tracing would show what lock they were waiting on and what's happening with it.

I'd guess that the server commits are insanely slow, and they're in the path of forward progress on locks. Clients won't send invalidation responses to release write locks until they commit their dirty transaction.

The entire point of the test is to create dense client-side allocation btrees full of fragmented extent items that then cause a block of IO on the server per item in the client's trees. A few hundred items per block is a crazy iop amplification factor for a commit that can have thousands of blocks.

I'd try an experiment that limits the amount of work that try_drain_data_freed() will do before giving up and letting its caller send the commit response.

@aversecat
Copy link
Contributor

aversecat commented Jun 12, 2025 via email

zabbo
zabbo previously requested changes Jun 12, 2025
Copy link
Collaborator

@zabbo zabbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully the orphan inode group binning will mean we don't have to return to this, but if we do, I wanted to get my thoughts down while they were fresh.

@zabbo zabbo self-requested a review June 12, 2025 21:59
Copy link
Collaborator

@zabbo zabbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this time I think I actually managed to click all the right buttons to provide review comments! As we've discussed, though, hopefully the orphan scanner inode sharding will mean we don't need to bother with this. We'll see.

int scoutfs_data_truncate_items(struct super_block *sb, struct inode *inode,
u64 ino, u64 iblock, u64 last, bool offline,
struct scoutfs_lock *lock)
u64 ino, u64 *iblock, u64 last, bool offline,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function has gotten to the point where it's no longer helpful shared code with some differences, but acctually two different things interleaved with conditionals. Instead of also letting it make partial progress, let's split it into two functions: this that truncates live inodes and requires the inode pointer, and a new _data_free_items that is specifically built to remove a transaction's worth of extents and then return. It wouldn't need any of the arguments -- it'd always start at 0 and remove whatever it finds.

scoutfs_ino(inode),
sblock, U64_MAX,
false, lock);
&sblock, U64_MAX,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes make me nervous. This used to obviously always truncate (s,e) and (s, ~0). Now it probably truncates from (s, e) and (e, ~0), but we'd have to go diving into the implementation to make sure there isn't some accidental final increment before returning that means it actually truncates (e + 1, ~0), or whatever. Let's avoid this pattern as much as possible. (It's why truncate_extent_items() returns the block number instead of updating an arg. This lesson was learned over and over in the kernel with the *ppos arguments to write functions that sometimes updated write pointers, or not, and was a nightmare.)

scoutfs_unlock(sb, lock, SCOUTFS_LOCK_WRITE);
scoutfs_unlock(sb, orph_lock, SCOUTFS_LOCK_WRITE_ONLY);

if (more)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this loop without clearing more at the start makes me nervous. It certainly looks like it'll loop forever if any of the errors before more is cleared are returned (perhaps from shutdown or forced unmount returning blanket errors from net/io paths). I'd rather not bother with this additional loop and would instead make sure to do enough work per iteration that we don't mind returning to the caller and letting them use their existing retry mechanisms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants