Skip to content

patronictl reinit does not work #1574

@cbartz

Description

@cbartz

Steps to reproduce

  1. Deploy charmed-postgresql with 3 units (1 leader, 2 replicas)
  2. Allow a replica to fall behind such that a reinit is needed (e.g. due to a large timeline gap)
  3. Run patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml reinit postgresql <member>
  4. Monitor /var/snap/charmed-postgresql/common/var/log/patroni/patroni.log

Expected behavior

If reinit fails, the log should clearly report the failure reason so the operator can act on it.

Actual behavior

Patroni enters a silent retry loop. The only output visible in the current log file is:

WARNING: Retry got exception: connection problems
WARNING: Failed to determine PostgreSQL state from the connection, falling back to cached role
INFO: restarting after failure in progress

Possible root cause

A possible root cause (OSError: [Errno 16] Device or resource busy on the pg_wal bind mount) is only visible buried in rotated patroni.log.N files. There is no clear failure message, no actionable guidance, and the misleading "connection problems" warning sends operators investigating networking rather than the reinit failure itself.

The OSError occurs because /var/snap/charmed-postgresql/common/data/logs is a bind mount used as the pg_wal directory. Patroni's reinit calls shutil.rmtree() which attempts to rename this path to *.failed — blocked by the kernel as a cross-mount rename (EBUSY). This is snap-specific but the core problem is that the error is never surfaced in a visible log line.

Versions

Operating system: Ubuntu 22.04.5 LTS

Juju CLI: 3.6.14

Juju agent: 3.6.14

Charm revision: postgresql 16/stable rev 952

LXD: N/A

Log output

# What the operator sees in the current patroni.log:
WARNING: Retry got exception: connection problems
INFO: restarting after failure in progress   <-- loops indefinitely

# Actual cause, buried in rotated patroni.log.N files:
OSError: [Errno 16] Device or resource busy: '/var/snap/charmed-postgresql/common/data/logs' -> '/var/snap/charmed-postgresql/common/data/logs.failed'
ERROR: Error when fetching backup: pg_basebackup exited with code=1
ERROR: failed to bootstrap from leader 'postgresql-4'

Additional context

Workaround: manually stop Patroni, clear the data directory, run pg_basebackup directly, recreate the pg_wal symlink (ln -s /var/snap/charmed-postgresql/common/data/logs /var/snap/charmed-postgresql/common/var/lib/postgresql/pg_wal), fix ownership of the tablespace target directory (chown -R _daemon_:_daemon_ /var/snap/charmed-postgresql/common/data/temp), then restart Patroni.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions