fix: prevent deadlock when shutdown is called concurrently#89
Merged
Conversation
e060168 to
2a3a1c6
Compare
2a3a1c6 to
b826b87
Compare
e01c73f to
f917625
Compare
POSIX Compliance (pjdfstest) |
Two fixes for a deadlock where destroy() and the signal handler both call FlushManager::shutdown() concurrently: 1. Make shutdown() single-shot with std::sync::Once. The first caller runs the full flush. Concurrent callers block until it completes, then return without re-running. 2. In the signal handler, return instead of process::exit(1) when unmount_fuse() fails, since this means destroy() is already handling the flush via the normal shutdown path. Root cause: the CSI driver does fuseUnmount (triggering destroy/flush) then Delete(pod) (sending SIGTERM) nearly simultaneously. The signal handler's shutdown() call deadlocked on the handle mutex held by destroy()'s in-flight flush (MutexGuard stayed alive across block_on inside the if-let body). With Once, the second caller simply waits for the first to finish.
f917625 to
6c2ad07
Compare
Benchmark Results |
Spawns two threads calling VirtualFs::shutdown() simultaneously (simulating destroy() + signal handler race). Asserts both complete within 10s. Without the Once fix, this would deadlock on the FlushManager handle mutex.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix a deadlock where
destroy()and the signal handler both callFlushManager::shutdown()concurrently, causing the process to hang forever and dirty files to never be uploaded.Root cause
When a CSI-managed pod terminates, the CSI driver does
fuseUnmount(source)(triggeringdestroy()-> flush) thenDelete(pod)(SIGTERM) nearly simultaneously. The signal handler callsshutdown()which deadlocks onself.handle.lock()becausedestroy()'sshutdown()holds the MutexGuard acrossruntime.block_on(handle)(the guard stays alive inside theif letbody).Production logs show 53ms between destroy starting flush and SIGTERM arriving. Process hung indefinitely with 2 dirty files never uploaded.
Fix
AtomicBool: first caller runs the flush, concurrent callers return immediatelylet handle = lock.take(); drop(lock); block_on(handle)instead ofif let Some(h) = lock.take() { block_on(h) }which holds the guardprocess::exit(1)when unmount fails, letting the normal shutdown path completeTests
215 unit tests pass. The deadlock was timing-dependent (requires SIGTERM during in-flight flush) and was observed in CI but not reproducible on EC2.