Fix pool reference deadlock during PAUSE/RELOAD/RESUME #941
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This commit fixes a critical deadlock issue where client connections would hang indefinitely during database switchover operations involving PAUSE, RELOAD, and RESUME commands in transaction pooling mode.
Problem
When performing a hot database switchover (PAUSE → RELOAD → RESUME), client applications would become permanently stuck when trying to establish new connections, even after RESUME completed. This made PgCat unsuitable for zero-downtime database migration scenarios.
The issue had two root causes:
Pool Reference Deadlock: When RELOAD creates a new pool object, clients waiting on the old pool's
paused_waiterwere never woken up when RESUME was called on the new pool. This caused permanent deadlock because:OLD_pool.wait_paused().awaitpaused_waiterNEW_pool.resume()Unvalidated Pools After RELOAD: New pools created during PAUSE were not validated before use, potentially causing authentication to block if validation was triggered during client connection.
Solution
This fix implements a two-part solution:
Part 1: Make resume() async and validate pools (pool.rs)
resume()from sync to async functionPart 2: Resume old pools before RELOAD (config.rs)
wait_paused()pool = self.get_pool()refresh lineTesting
Unit Tests: All 38 unit tests + 4 doc tests pass
cargo test- all tests passingcargo fmt- code properly formattedcargo clippy- no warningsIntegration Tests: Real-world database switchover scenario
Test results from production-like switchover scenario:
Files Changed
src/pool.rs: Maderesume()async, added validation logicsrc/admin.rs: Updatedresume()call sites to awaitsrc/config.rs: Added old pool resume before RELOAD, importedget_all_poolstests/ruby/pause_new_connections_spec.rb: Added comprehensive testImpact
This fix enables:
Related Issues
This addresses the issue described in FIX.md regarding broken PAUSE/RESUME support where new client connections would hang indefinitely during PAUSE operations.
🤖 Co-Authored-By: Claude noreply@anthropic.com