Skip to content

Conversation

@bhtek
Copy link

@bhtek bhtek commented Nov 19, 2025

This commit fixes a critical deadlock issue where client connections would hang indefinitely during database switchover operations involving PAUSE, RELOAD, and RESUME commands in transaction pooling mode.

Problem

When performing a hot database switchover (PAUSE → RELOAD → RESUME), client applications would become permanently stuck when trying to establish new connections, even after RESUME completed. This made PgCat unsuitable for zero-downtime database migration scenarios.

The issue had two root causes:

  1. Pool Reference Deadlock: When RELOAD creates a new pool object, clients waiting on the old pool's paused_waiter were never woken up when RESUME was called on the new pool. This caused permanent deadlock because:

    • Client holds reference to OLD pool (pre-RELOAD)
    • Client blocks on OLD_pool.wait_paused().await
    • RELOAD creates NEW pool with different paused_waiter
    • RESUME calls NEW_pool.resume()
    • Client still waiting on OLD pool → deadlock!
  2. Unvalidated Pools After RELOAD: New pools created during PAUSE were not validated before use, potentially causing authentication to block if validation was triggered during client connection.

Solution

This fix implements a two-part solution:

Part 1: Make resume() async and validate pools (pool.rs)

  • Changed resume() from sync to async function
  • Added pool validation before resuming if pool is unvalidated
  • Ensures pools are ready for use before accepting client connections
  • Updated all call sites in admin.rs to await the async function

Part 2: Resume old pools before RELOAD (config.rs)

  • Before creating new pools during RELOAD, explicitly resume all paused old pools to wake up waiting clients
  • This allows clients to:
    1. Wake up from old pool's wait_paused()
    2. Reach the pool = self.get_pool() refresh line
    3. Obtain reference to NEW pool
    4. Continue normal operation with new pool

Testing

Unit Tests: All 38 unit tests + 4 doc tests pass

  • cargo test - all tests passing
  • cargo fmt - code properly formatted
  • cargo clippy - no warnings

Integration Tests: Real-world database switchover scenario

  • ✓ PAUSE during active write workload (transaction pooling)
  • ✓ RELOAD with backend database change (postgres1 → postgres2)
  • ✓ RESUME with immediate write recovery
  • ✓ No connection hangs
  • ✓ No data loss or sequence gaps
  • ✓ 47 successful writes within 5 seconds post-RESUME

Test results from production-like switchover scenario:

Recent writes (last 5s):
  postgres1: 0
  postgres2: 47
✓ postgres2 is receiving writes, postgres1 is not - SWITCHOVER SUCCESSFUL!
✓ Writer recovered successfully (iteration 200 → 300)
✓ No gaps in iteration sequence

Files Changed

  • src/pool.rs: Made resume() async, added validation logic
  • src/admin.rs: Updated resume() call sites to await
  • src/config.rs: Added old pool resume before RELOAD, imported get_all_pools
  • tests/ruby/pause_new_connections_spec.rb: Added comprehensive test

Impact

This fix enables:

  • Zero-downtime database migrations using PgCat
  • Safe hot switchover between primary/replica or different databases
  • Reliable PAUSE/RELOAD/RESUME workflow in production environments
  • Compatibility with transaction pooling mode during switchovers

Related Issues

This addresses the issue described in FIX.md regarding broken PAUSE/RESUME support where new client connections would hang indefinitely during PAUSE operations.

🤖 Co-Authored-By: Claude noreply@anthropic.com

This commit fixes a critical deadlock issue where client connections
would hang indefinitely during database switchover operations involving
PAUSE, RELOAD, and RESUME commands in transaction pooling mode.

## Problem

When performing a hot database switchover (PAUSE → RELOAD → RESUME),
client applications would become permanently stuck when trying to
establish new connections, even after RESUME completed. This made
PgCat unsuitable for zero-downtime database migration scenarios.

The issue had two root causes:

1. **Pool Reference Deadlock**: When RELOAD creates a new pool object,
   clients waiting on the old pool's `paused_waiter` were never woken
   up when RESUME was called on the new pool. This caused permanent
   deadlock because:
   - Client holds reference to OLD pool (pre-RELOAD)
   - Client blocks on `OLD_pool.wait_paused().await`
   - RELOAD creates NEW pool with different `paused_waiter`
   - RESUME calls `NEW_pool.resume()`
   - Client still waiting on OLD pool → deadlock!

2. **Unvalidated Pools After RELOAD**: New pools created during PAUSE
   were not validated before use, potentially causing authentication
   to block if validation was triggered during client connection.

## Solution

This fix implements a two-part solution:

### Part 1: Make resume() async and validate pools (pool.rs)
- Changed `resume()` from sync to async function
- Added pool validation before resuming if pool is unvalidated
- Ensures pools are ready for use before accepting client connections
- Updated all call sites in admin.rs to await the async function

### Part 2: Resume old pools before RELOAD (config.rs)
- Before creating new pools during RELOAD, explicitly resume all
  paused old pools to wake up waiting clients
- This allows clients to:
  1. Wake up from old pool's `wait_paused()`
  2. Reach the `pool = self.get_pool()` refresh line
  3. Obtain reference to NEW pool
  4. Continue normal operation with new pool

## Testing

**Unit Tests**: All 38 unit tests + 4 doc tests pass
- ✓ `cargo test` - all tests passing
- ✓ `cargo fmt` - code properly formatted
- ✓ `cargo clippy` - no warnings

**Integration Tests**: Real-world database switchover scenario
- ✓ PAUSE during active write workload (transaction pooling)
- ✓ RELOAD with backend database change (postgres1 → postgres2)
- ✓ RESUME with immediate write recovery
- ✓ No connection hangs
- ✓ No data loss or sequence gaps
- ✓ 47 successful writes within 5 seconds post-RESUME

Test results from production-like switchover scenario:
```
Recent writes (last 5s):
  postgres1: 0
  postgres2: 47
✓ postgres2 is receiving writes, postgres1 is not - SWITCHOVER SUCCESSFUL!
✓ Writer recovered successfully (iteration 200 → 300)
✓ No gaps in iteration sequence
```

## Files Changed

- `src/pool.rs`: Made `resume()` async, added validation logic
- `src/admin.rs`: Updated `resume()` call sites to await
- `src/config.rs`: Added old pool resume before RELOAD, imported `get_all_pools`
- `tests/ruby/pause_new_connections_spec.rb`: Added comprehensive test

## Impact

This fix enables:
- Zero-downtime database migrations using PgCat
- Safe hot switchover between primary/replica or different databases
- Reliable PAUSE/RELOAD/RESUME workflow in production environments
- Compatibility with transaction pooling mode during switchovers

## Related Issues

This addresses the issue described in FIX.md regarding broken
PAUSE/RESUME support where new client connections would hang
indefinitely during PAUSE operations.

🤖 Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant