Fix pool reference deadlock during PAUSE/RELOAD/RESUME #941

bhtek · 2025-11-19T03:52:50Z

This commit fixes a critical deadlock issue where client connections would hang indefinitely during database switchover operations involving PAUSE, RELOAD, and RESUME commands in transaction pooling mode.

Problem

When performing a hot database switchover (PAUSE → RELOAD → RESUME), client applications would become permanently stuck when trying to establish new connections, even after RESUME completed. This made PgCat unsuitable for zero-downtime database migration scenarios.

The issue had two root causes:

Pool Reference Deadlock: When RELOAD creates a new pool object, clients waiting on the old pool's paused_waiter were never woken up when RESUME was called on the new pool. This caused permanent deadlock because:
- Client holds reference to OLD pool (pre-RELOAD)
- Client blocks on OLD_pool.wait_paused().await
- RELOAD creates NEW pool with different paused_waiter
- RESUME calls NEW_pool.resume()
- Client still waiting on OLD pool → deadlock!
Unvalidated Pools After RELOAD: New pools created during PAUSE were not validated before use, potentially causing authentication to block if validation was triggered during client connection.

Solution

This fix implements a two-part solution:

Part 1: Make resume() async and validate pools (pool.rs)

Changed resume() from sync to async function
Added pool validation before resuming if pool is unvalidated
Ensures pools are ready for use before accepting client connections
Updated all call sites in admin.rs to await the async function

Part 2: Resume old pools before RELOAD (config.rs)

Before creating new pools during RELOAD, explicitly resume all paused old pools to wake up waiting clients
This allows clients to:
1. Wake up from old pool's wait_paused()
2. Reach the pool = self.get_pool() refresh line
3. Obtain reference to NEW pool
4. Continue normal operation with new pool

Testing

Unit Tests: All 38 unit tests + 4 doc tests pass

✓ cargo test - all tests passing
✓ cargo fmt - code properly formatted
✓ cargo clippy - no warnings

Integration Tests: Real-world database switchover scenario

✓ PAUSE during active write workload (transaction pooling)
✓ RELOAD with backend database change (postgres1 → postgres2)
✓ RESUME with immediate write recovery
✓ No connection hangs
✓ No data loss or sequence gaps
✓ 47 successful writes within 5 seconds post-RESUME

Test results from production-like switchover scenario:

Recent writes (last 5s):
  postgres1: 0
  postgres2: 47
✓ postgres2 is receiving writes, postgres1 is not - SWITCHOVER SUCCESSFUL!
✓ Writer recovered successfully (iteration 200 → 300)
✓ No gaps in iteration sequence

Files Changed

src/pool.rs: Made resume() async, added validation logic
src/admin.rs: Updated resume() call sites to await
src/config.rs: Added old pool resume before RELOAD, imported get_all_pools
tests/ruby/pause_new_connections_spec.rb: Added comprehensive test

Impact

This fix enables:

Zero-downtime database migrations using PgCat
Safe hot switchover between primary/replica or different databases
Reliable PAUSE/RELOAD/RESUME workflow in production environments
Compatibility with transaction pooling mode during switchovers

Related Issues

This addresses the issue described in FIX.md regarding broken PAUSE/RESUME support where new client connections would hang indefinitely during PAUSE operations.

🤖 Co-Authored-By: Claude noreply@anthropic.com

This commit fixes a critical deadlock issue where client connections would hang indefinitely during database switchover operations involving PAUSE, RELOAD, and RESUME commands in transaction pooling mode. ## Problem When performing a hot database switchover (PAUSE → RELOAD → RESUME), client applications would become permanently stuck when trying to establish new connections, even after RESUME completed. This made PgCat unsuitable for zero-downtime database migration scenarios. The issue had two root causes: 1. **Pool Reference Deadlock**: When RELOAD creates a new pool object, clients waiting on the old pool's `paused_waiter` were never woken up when RESUME was called on the new pool. This caused permanent deadlock because: - Client holds reference to OLD pool (pre-RELOAD) - Client blocks on `OLD_pool.wait_paused().await` - RELOAD creates NEW pool with different `paused_waiter` - RESUME calls `NEW_pool.resume()` - Client still waiting on OLD pool → deadlock! 2. **Unvalidated Pools After RELOAD**: New pools created during PAUSE were not validated before use, potentially causing authentication to block if validation was triggered during client connection. ## Solution This fix implements a two-part solution: ### Part 1: Make resume() async and validate pools (pool.rs) - Changed `resume()` from sync to async function - Added pool validation before resuming if pool is unvalidated - Ensures pools are ready for use before accepting client connections - Updated all call sites in admin.rs to await the async function ### Part 2: Resume old pools before RELOAD (config.rs) - Before creating new pools during RELOAD, explicitly resume all paused old pools to wake up waiting clients - This allows clients to: 1. Wake up from old pool's `wait_paused()` 2. Reach the `pool = self.get_pool()` refresh line 3. Obtain reference to NEW pool 4. Continue normal operation with new pool ## Testing **Unit Tests**: All 38 unit tests + 4 doc tests pass - ✓ `cargo test` - all tests passing - ✓ `cargo fmt` - code properly formatted - ✓ `cargo clippy` - no warnings **Integration Tests**: Real-world database switchover scenario - ✓ PAUSE during active write workload (transaction pooling) - ✓ RELOAD with backend database change (postgres1 → postgres2) - ✓ RESUME with immediate write recovery - ✓ No connection hangs - ✓ No data loss or sequence gaps - ✓ 47 successful writes within 5 seconds post-RESUME Test results from production-like switchover scenario: ``` Recent writes (last 5s): postgres1: 0 postgres2: 47 ✓ postgres2 is receiving writes, postgres1 is not - SWITCHOVER SUCCESSFUL! ✓ Writer recovered successfully (iteration 200 → 300) ✓ No gaps in iteration sequence ``` ## Files Changed - `src/pool.rs`: Made `resume()` async, added validation logic - `src/admin.rs`: Updated `resume()` call sites to await - `src/config.rs`: Added old pool resume before RELOAD, imported `get_all_pools` - `tests/ruby/pause_new_connections_spec.rb`: Added comprehensive test ## Impact This fix enables: - Zero-downtime database migrations using PgCat - Safe hot switchover between primary/replica or different databases - Reliable PAUSE/RELOAD/RESUME workflow in production environments - Compatibility with transaction pooling mode during switchovers ## Related Issues This addresses the issue described in FIX.md regarding broken PAUSE/RESUME support where new client connections would hang indefinitely during PAUSE operations. 🤖 Co-Authored-By: Claude <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix pool reference deadlock during PAUSE/RELOAD/RESUME #941

Fix pool reference deadlock during PAUSE/RELOAD/RESUME #941

Uh oh!

bhtek commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix pool reference deadlock during PAUSE/RELOAD/RESUME #941

Are you sure you want to change the base?

Fix pool reference deadlock during PAUSE/RELOAD/RESUME #941

Uh oh!

Conversation

bhtek commented Nov 19, 2025

Problem

Solution

Part 1: Make resume() async and validate pools (pool.rs)

Part 2: Resume old pools before RELOAD (config.rs)

Testing

Files Changed

Impact

Related Issues

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant