fix(cdc): prevent frontier from advancing past uncommitted change-buffer rows (issue #536) by BaardBouvet · Pull Request #593 · grove/pg-trickle

BaardBouvet · 2026-04-19T19:17:19Z

Summary

Adds a design plan for fixing the silent data-loss race in trigger-based CDC
identified in issue #536. The root cause is that the refresh frontier advances
based on LSN ordering alone, while change-buffer reads run under MVCC
visibility — a long-running uncommitted transaction can cause its CDC rows to
be permanently skipped once the frontier moves past their LSN.

Changes

plans/safety/PLAN_FRONTIER_VISIBILITY_HOLDBACK.md — new plan covering:
- Root-cause analysis with verified code references (src/scheduler.rs,
  src/cdc.rs, src/refresh/mod.rs)
- Layer A: per-tick pg_stat_activity xmin probe to hold back the
  frontier (default ON, new GUC pg_trickle.frontier_holdback_mode)
- Layer B: optional xmin xid8 column on change-buffer tables for
  belt-and-suspenders visibility filtering and future CSN compatibility
  (opt-in, does not change current schema)
- Layer C: escape-hatch GUC values (none, lsn:<bytes>) to preserve
  the fast path for benchmarks and clean OLTP workloads
- Metrics (pgtrickle_frontier_holdback_lsn_bytes,
  pgtrickle_frontier_holdback_seconds) and a WARNING when holdback
  stalls beyond a configurable threshold
- Implementation steps with concrete file targets
- Acceptance criteria and required test coverage including a dedicated
  tests/e2e_long_txn_visibility_tests.rs
- Risk / trade-off table and out-of-scope follow-ups (sequence cache
  contention, atomic frontier+buffer commit)

Testing

No code changes in this PR — plan document only.
Implementation tests are specified in the plan:

Unit: classify_holdback pure-logic helper (no DB)
Integration: open-transaction + scheduler tick straddling scenario
E2E: e2e_long_txn_visibility_tests.rs covering READ COMMITTED, REPEATABLE
READ, 2PC, and a regression guard that confirms mode = none reproduces
the data loss

Notes

The WAL / logical-decoding CDC backend is already immune to this race (it
only emits committed changes ordered by commit LSN). The fix here targets the
trigger-based fallback path that must remain correct on clusters where
wal_level = logical is unavailable.
Closes issue [Architecture] Concurrency and Visibility Flaws in Trigger-Based CDC Mode (Data Loss & Sequence Bottleneck) #536 (tracking). Implementation will land in a follow-up PR.
Related: plans/PLAN_OVERALL_ASSESSMENT_2.md (non-atomic
frontier+buffer commit — separate issue).

Implementation (commit `2ddc94a`)

Implemented the holdback fix across 9 files:

src/config.rs — pg_trickle.frontier_holdback_mode GUC (default xmin) + pg_trickle.frontier_holdback_warn_seconds GUC (default 60)
src/shmem.rs — last_tick_oldest_xmin / last_tick_safe_lsn_u64 shared state + two PgAtomic metrics statics
src/version.rs — public lsn_to_u64() / u64_to_lsn() utilities
src/cdc.rs — classify_holdback() (pure, unit-tested) + compute_safe_upper_bound() probing pg_stat_activity + pg_prepared_xacts
src/scheduler.rs — all 6 tick_watermark sites replaced; compute_coordinator_tick_watermark(), compute_worker_tick_watermark(), emit_holdback_warning_if_needed()
src/monitor.rs — pgtrickle_frontier_holdback_lsn_bytes and pgtrickle_frontier_holdback_seconds Prometheus gauges
tests/e2e_long_txn_visibility_tests.rs — 5 new E2E tests (GUC defaults, READ COMMITTED, REPEATABLE READ, 2PC, regression guard)
docs/ARCHITECTURE.md — Frontier Visibility Holdback subsection
docs/TROUBLESHOOTING.md — Section 14 with diagnosis and resolution

All 1856 unit tests pass.

…fer rows (issue #536) Add frontier visibility holdback to close a silent data-loss window: when a long-running transaction inserts into a tracked source table and commits after the scheduler has already ticked, its CDC rows (recorded at the insert LSN) were permanently skipped because the frontier had already advanced past them. Changes: - config.rs: add pg_trickle.frontier_holdback_mode GUC (default 'xmin') and pg_trickle.frontier_holdback_warn_seconds GUC (default 60) - shmem.rs: add last_tick_oldest_xmin / last_tick_safe_lsn_u64 shared state fields and two PgAtomic metrics statics - version.rs: expose lsn_to_u64() / u64_to_lsn() as public utilities - cdc.rs: add classify_holdback() (pure, unit-tested) and compute_safe_upper_bound() which probes pg_stat_activity + pg_prepared_xacts each tick to derive the safe LSN ceiling - scheduler.rs: replace all 6 tick_watermark capture sites with compute_coordinator_tick_watermark() / compute_worker_tick_watermark(); add rate-limited warning via emit_holdback_warning_if_needed() - monitor.rs: add pgtrickle_frontier_holdback_lsn_bytes and pgtrickle_frontier_holdback_seconds Prometheus gauges - tests/e2e_long_txn_visibility_tests.rs: 5 new E2E tests covering GUC defaults, READ COMMITTED long txn, REPEATABLE READ long txn, 2PC prepared transaction, and a regression guard for mode='none' - docs/ARCHITECTURE.md: add Frontier Visibility Holdback subsection - docs/TROUBLESHOOTING.md: add section 14 with diagnosis and resolution

- docs: restore accidentally-deleted Symptoms block from TROUBLESHOOTING Section 13 (Fuse Tripped / Circuit Breaker) - shmem: add set_last_tick_holdback_state(xmin, lsn) helper that updates both fields under a single exclusive lock so dynamic workers never see an inconsistent xmin/LSN pair; use it in scheduler - cdc: add inline comment noting xid is 32-bit (PG18) and the bigint cast is safe; explain why the ::text::bigint path is future-proof - cdc: document the xid wraparound assumption in classify_holdback — linear u64 comparison is valid because wraparound between ticks (100ms-10s) requires ~4 billion commits - plan: update Status from Draft to Implemented - test: replace vague ignore comment with explicit cargo invocation command for the regression guard test

Copilot

Pull request overview

Implements the Issue #536 “frontier visibility holdback” fix to prevent trigger-based CDC from advancing the refresh frontier past uncommitted change-buffer rows, adding configuration knobs, shared-state plumbing, scheduler integration, observability metrics, E2E coverage, and documentation.

Changes:

Add holdback mode + warning-threshold GUCs and implement per-tick safe watermark computation using xmin/2PC probes.
Persist per-tick holdback state/metrics in shared memory and expose new Prometheus gauges.
Add dedicated E2E tests plus architecture/troubleshooting documentation for the holdback behavior.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`src/config.rs`	Adds `frontier_holdback_mode` + `frontier_holdback_warn_seconds` GUCs and normalization/tests.
`src/cdc.rs`	Adds `classify_holdback()` and SPI probe to compute safe LSN upper bound.
`src/scheduler.rs`	Wires holdback-aware tick watermarks into coordinator + dynamic workers and adds warning rate-limiting.
`src/shmem.rs`	Stores last-tick xmin + safe LSN and holdback gauge values in shmem.
`src/version.rs`	Adds public LSN string ⇄ `u64` helpers for shmem storage and computations.
`src/monitor.rs`	Emits new Prometheus gauges for current holdback distance/age.
`tests/e2e_long_txn_visibility_tests.rs`	Adds E2E scenarios for READ COMMITTED / REPEATABLE READ / 2PC, plus an ignored “unsafe mode” guard.
`docs/ARCHITECTURE.md`	Documents the root cause and the holdback algorithm/config/metrics.
`docs/TROUBLESHOOTING.md`	Adds operator workflow for diagnosing “stuck behind long transaction” behavior.
`plans/safety/PLAN_FRONTIER_VISIBILITY_HOLDBACK.md`	Design/implementation plan updated to “Implemented” with acceptance criteria and rationale.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-19T20:49:21Z

+async fn configure_fast_scheduler(db: &E2eDb) {
+    db.execute("ALTER SYSTEM SET pg_trickle.scheduler_interval_ms = 100")
+        .await;
+    db.execute("ALTER SYSTEM SET pg_trickle.min_schedule_seconds = 1")
+        .await;
+    db.execute("ALTER SYSTEM SET pg_trickle.auto_backoff = off")
+        .await;
+    db.reload_config_and_wait().await;
+    db.wait_for_setting("pg_trickle.scheduler_interval_ms", "100")


These E2E tests are intended to validate the trigger-based CDC visibility race (Issue #536), but they never force pg_trickle.cdc_mode = 'trigger'. In the E2E image wal_level = logical and cdc_mode defaults to auto, so the system may promote to WAL CDC (which is already immune) and make these tests pass without exercising the buggy trigger path. Set pg_trickle.cdc_mode to 'trigger' as part of the shared setup (and wait for it) so the tests actually cover the intended behavior.

Copilot · 2026-04-19T20:49:21Z

+- `pgtrickle_frontier_holdback_lsn_bytes` — how many WAL bytes behind write_lsn the safe frontier currently is.
+- `pgtrickle_frontier_holdback_seconds` — age (in seconds) of the oldest in-progress transaction.


Docs list the new holdback metrics as pgtrickle_frontier_holdback_*, but the project’s Prometheus docs state all exported metrics are prefixed pg_trickle_ (see docs/integrations/prometheus.md). After renaming the metrics in the exporter output, update these names here as well so operators can find the right series.

Suggested change

- `pgtrickle_frontier_holdback_lsn_bytes` — how many WAL bytes behind write_lsn the safe frontier currently is.

- `pgtrickle_frontier_holdback_seconds` — age (in seconds) of the oldest in-progress transaction.

- `pg_trickle_frontier_holdback_lsn_bytes` — how many WAL bytes behind write_lsn the safe frontier currently is.

- `pg_trickle_frontier_holdback_seconds` — age (in seconds) of the oldest in-progress transaction.

Copilot · 2026-04-19T20:49:22Z

+**Symptoms:**
+- A stream table's `data_timestamp` is not advancing even though the source
+  table is receiving new inserts.
+- The `pgtrickle_frontier_holdback_lsn_bytes` Prometheus gauge is non-zero.


The metric name in Symptoms uses the pgtrickle_... prefix, but the Prometheus integration docs state exported metrics are prefixed pg_trickle_. Keep this consistent with the actual /metrics output by updating it to pg_trickle_frontier_holdback_lsn_bytes (and likewise for any other references in this section).

Suggested change

- The `pgtrickle_frontier_holdback_lsn_bytes` Prometheus gauge is non-zero.

- The `pg_trickle_frontier_holdback_lsn_bytes` Prometheus gauge is non-zero.

Copilot · 2026-04-19T20:49:22Z

+   - `pgtrickle_frontier_holdback_lsn_bytes` (gauge: how far behind write_lsn)
+   - `pgtrickle_frontier_holdback_seconds` (gauge: oldest in-progress txn age)


This plan references the holdback metrics with a pgtrickle_... prefix, but the repository’s Prometheus docs state all exported metrics are prefixed pg_trickle_. Once the metrics are renamed in code, update the metric names here to match so the plan remains accurate for operators.

Suggested change

- `pgtrickle_frontier_holdback_lsn_bytes` (gauge: how far behind write_lsn)

- `pgtrickle_frontier_holdback_seconds` (gauge: oldest in-progress txn age)

- `pg_trickle_frontier_holdback_lsn_bytes` (gauge: how far behind write_lsn)

- `pg_trickle_frontier_holdback_seconds` (gauge: oldest in-progress txn age)

Copilot · 2026-04-19T20:49:22Z

+        "# HELP pgtrickle_frontier_holdback_lsn_bytes \
+         How many WAL bytes behind the write LSN the safe frontier currently is (0 = no holdback)\n",
+    );
+    out.push_str("# TYPE pgtrickle_frontier_holdback_lsn_bytes gauge\n");
+    out.push_str(&format!(
+        "pgtrickle_frontier_holdback_lsn_bytes {holdback_lsn}\n"
+    ));
+    out.push_str(
+        "# HELP pgtrickle_frontier_holdback_seconds \
+         Age in seconds of the oldest in-progress transaction causing a holdback (0 = no holdback)\n",
+    );
+    out.push_str("# TYPE pgtrickle_frontier_holdback_seconds gauge\n");
+    out.push_str(&format!(
+        "pgtrickle_frontier_holdback_seconds {holdback_age}\n"


The new Prometheus metrics are emitted with the pgtrickle_... prefix, but the existing metrics in this endpoint use pg_trickle_... and the Prometheus integration docs state all metrics are prefixed pg_trickle_. To keep metric naming consistent (and avoid surprise dashboards/alerts), rename these to pg_trickle_frontier_holdback_lsn_bytes / pg_trickle_frontier_holdback_seconds (including HELP/TYPE lines).

Suggested change

"# HELP pgtrickle_frontier_holdback_lsn_bytes \

How many WAL bytes behind the write LSN the safe frontier currently is (0 = no holdback)\n",

);

out.push_str("# TYPE pgtrickle_frontier_holdback_lsn_bytes gauge\n");

out.push_str(&format!(

"pgtrickle_frontier_holdback_lsn_bytes {holdback_lsn}\n"

));

out.push_str(

"# HELP pgtrickle_frontier_holdback_seconds \

Age in seconds of the oldest in-progress transaction causing a holdback (0 = no holdback)\n",

);

out.push_str("# TYPE pgtrickle_frontier_holdback_seconds gauge\n");

out.push_str(&format!(

"pgtrickle_frontier_holdback_seconds {holdback_age}\n"

"# HELP pg_trickle_frontier_holdback_lsn_bytes \

How many WAL bytes behind the write LSN the safe frontier currently is (0 = no holdback)\n",

);

out.push_str("# TYPE pg_trickle_frontier_holdback_lsn_bytes gauge\n");

out.push_str(&format!(

"pg_trickle_frontier_holdback_lsn_bytes {holdback_lsn}\n"

));

out.push_str(

"# HELP pg_trickle_frontier_holdback_seconds \

Age in seconds of the oldest in-progress transaction causing a holdback (0 = no holdback)\n",

);

out.push_str("# TYPE pg_trickle_frontier_holdback_seconds gauge\n");

out.push_str(&format!(

"pg_trickle_frontier_holdback_seconds {holdback_age}\n"

Copilot · 2026-04-19T20:49:23Z

+                    // On probe failure, fall back to current write LSN.
+                    log!(
+                        "pg_trickle: holdback probe failed ({}); using raw write LSN",
+                        e
+                    );
+                    let lsn =
+                        Spi::get_one::<String>("SELECT pg_current_wal_lsn()::text").unwrap_or(None);
+                    if let Some(ref l) = lsn {
+                        shmem::set_last_tick_safe_lsn(version::lsn_to_u64(l));
+                    }
+                    shmem::update_holdback_metrics(0, 0);
+                    (lsn, 0, 0)


On holdback probe failure, this falls back to the raw write LSN. That reintroduces the silent data-loss race the holdback is meant to prevent (a transient SPI failure during an in-flight long txn could advance the frontier past invisible rows). Prefer a fail-safe fallback that does not advance past the previous safe watermark (e.g., return prev_watermark_lsn when present, or disable watermark advancement for the tick) and keep the holdback metrics/warning accurate.

Suggested change

// On probe failure, fall back to current write LSN.

log!(

"pg_trickle: holdback probe failed ({}); using raw write LSN",

e

);

let lsn =

Spi::get_one::<String>("SELECT pg_current_wal_lsn()::text").unwrap_or(None);

if let Some(ref l) = lsn {

shmem::set_last_tick_safe_lsn(version::lsn_to_u64(l));

}

shmem::update_holdback_metrics(0, 0);

(lsn, 0, 0)

// On probe failure, fail safe: do not advance beyond the

// previous known-safe watermark. If none exists yet,

// disable watermark advancement for this tick.

if prev_watermark_lsn.is_some() {

log!(

"pg_trickle: holdback probe failed ({}); retaining previous safe watermark",

e

);

} else {

log!(

"pg_trickle: holdback probe failed ({}); disabling watermark advancement for this tick",

e

);

}

(prev_watermark_lsn.clone(), 0, 0)

Copilot · 2026-04-19T20:49:23Z

+    // tick boundary.  Reset to None on scheduler restart.
+    let mut prev_tick_watermark: Option<String> = None;


prev_tick_watermark is reset to None on scheduler restart, which means the first tick after a restart cannot hold back to a known-safe watermark. Combined with compute_safe_upper_bound() (which uses write_lsn when prev_watermark_lsn is None), this can still lose trigger-CDC rows from long-running transactions that are already open when the scheduler starts. Consider seeding prev_tick_watermark from shmem::last_tick_safe_lsn_u64() (when non-zero) so restarts preserve the last safe baseline.

Suggested change

// tick boundary. Reset to None on scheduler restart.

let mut prev_tick_watermark: Option<String> = None;

// tick boundary. Seed from shared memory on scheduler restart so the

// first post-restart tick preserves the last known-safe baseline.

let mut prev_tick_watermark: Option<String> = {

let last_safe_lsn = crate::shmem::last_tick_safe_lsn_u64();

if last_safe_lsn != 0 {

Some(pg_sys::XLogRecPtr::from(last_safe_lsn).to_string())

} else {

None

}

};

Copilot · 2026-04-19T20:49:23Z

+    if !is_shmem_available() {
+        return;
+    }
+    // Update both fields atomically under the same lock.


The comment on this helper says it updates both xmin and LSN “atomically under the same lock”, but the function only writes last_tick_safe_lsn_u64. Either update the comment to match reality or remove this helper and funnel callers through set_last_tick_holdback_state() when atomicity across both fields is required.

Suggested change

// Update both fields atomically under the same lock.

// Update only the cached safe LSN under the shared-memory lock.

- scheduler: on holdback probe failure, hold at prev_watermark_lsn instead of advancing to raw write LSN (which is the unsafe behavior the holdback is meant to prevent); only fall back to write LSN on the very first tick where no previous watermark is known - config: replace FrontierHoldbackMode::as_str() (which returned a static placeholder for LsnBytes) with display_string() that includes the actual byte count (e.g. "lsn:1048576") - config: add FrontierHoldbackMode::InvalidLsn sentinel so that lsn:<invalid> emits pgrx::warning! at the accessor site and falls back to Xmin (previously silently produced LsnBytes(0)) - version: add round-trip unit tests for lsn_to_u64 / u64_to_lsn including known-value and format assertions

…eSQL Add a one-time WARNING when the holdback probe cannot see any other backend processes in pg_stat_activity (visible_other_backends = 0). On managed services like RDS and Cloud SQL, pg_stat_activity is restricted to the current role's own sessions, so the probe silently returns min_xmin = 0 even when long-running transactions exist in other sessions -- the same silent data-loss scenario the holdback was designed to prevent. Detection: extend the probe query with a subquery that counts all pg_stat_activity rows excluding the bg-worker's own pid. A healthy PostgreSQL server always has background processes visible (checkpointer, autovacuum launcher, walsummarizer); 0 rows indicates restricted access. The warning fires once per server process lifetime via an AtomicBool and instructs operators to run: GRANT pg_monitor TO <pg_trickle_service_role>; Also document the managed-service caveat and fix command in TROUBLESHOOTING.md section 14, resolution step 5.

- monitor: rename Prometheus metrics pgtrickle_* -> pg_trickle_* to match the existing metric prefix used by all other gauges in the /metrics endpoint; update matching docs in ARCHITECTURE.md, TROUBLESHOOTING.md, and the plan document - shmem: fix misleading comment on set_last_tick_safe_lsn() -- the function only updates last_tick_safe_lsn_u64, not both fields; point callers needing atomic dual-field update to set_last_tick_holdback_state() instead - scheduler: seed prev_tick_watermark from shmem on startup so the first post-restart tick preserves the last known-safe LSN baseline; prevents a one-tick window where the frontier could advance past a long-running transaction already open before the scheduler restarted - tests: force pg_trickle.cdc_mode = 'trigger' in configure_fast_scheduler so holdback E2E tests actually exercise the trigger-based CDC path; WAL CDC is immune to the race and would make tests pass trivially

docs(safety): add plan for frontier visibility holdback fix (issue #536)

c586549

BaardBouvet requested a review from grove as a code owner April 19, 2026 19:17

BaardBouvet changed the title ~~docs(safety): add frontier visibility holdback plan (issue #536)~~ fix(cdc): prevent frontier from advancing past uncommitted change-buffer rows (issue #536) Apr 19, 2026

BaardBouvet added 2 commits April 19, 2026 20:34

fix(clippy): correct doc-comment indentation (2 spaces not 4)

70fba42

grove requested a review from Copilot April 19, 2026 20:43

Copilot started reviewing on behalf of grove April 19, 2026 20:43 View session

Copilot AI reviewed Apr 19, 2026

View reviewed changes

BaardBouvet added 3 commits April 19, 2026 21:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cdc): prevent frontier from advancing past uncommitted change-buffer rows (issue #536)#593

fix(cdc): prevent frontier from advancing past uncommitted change-buffer rows (issue #536)#593
BaardBouvet wants to merge 7 commits intomainfrom
docs/frontier-visibility-holdback-plan

BaardBouvet commented Apr 19, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 19, 2026

Uh oh!

Copilot AI Apr 19, 2026

Uh oh!

Copilot AI Apr 19, 2026

Uh oh!

Copilot AI Apr 19, 2026

Uh oh!

Copilot AI Apr 19, 2026

Uh oh!

Copilot AI Apr 19, 2026

Uh oh!

Copilot AI Apr 19, 2026

Uh oh!

Copilot AI Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		- `pgtrickle_frontier_holdback_lsn_bytes` — how many WAL bytes behind write_lsn the safe frontier currently is.
		- `pgtrickle_frontier_holdback_seconds` — age (in seconds) of the oldest in-progress transaction.

	- The `pgtrickle_frontier_holdback_lsn_bytes` Prometheus gauge is non-zero.
	- The `pg_trickle_frontier_holdback_lsn_bytes` Prometheus gauge is non-zero.

		// tick boundary. Reset to None on scheduler restart.
		let mut prev_tick_watermark: Option<String> = None;

-    // tick boundary.  Reset to None on scheduler restart.
-    let mut prev_tick_watermark: Option<String> = None;
+    // tick boundary. Seed from shared memory on scheduler restart so the
+    // first post-restart tick preserves the last known-safe baseline.
+    let mut prev_tick_watermark: Option<String> = {
+        let last_safe_lsn = crate::shmem::last_tick_safe_lsn_u64();
+        if last_safe_lsn != 0 {
+            Some(pg_sys::XLogRecPtr::from(last_safe_lsn).to_string())
+        } else {
+            None
+        }
+    };

	// Update both fields atomically under the same lock.
	// Update only the cached safe LSN under the shared-memory lock.

Conversation

BaardBouvet commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Notes

Implementation (commit 2ddc94a)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

BaardBouvet commented Apr 19, 2026 •

edited

Loading

Implementation (commit `2ddc94a`)