fix(cdc): prevent frontier from advancing past uncommitted change-buffer rows (issue #536)#593
fix(cdc): prevent frontier from advancing past uncommitted change-buffer rows (issue #536)#593BaardBouvet wants to merge 7 commits intomainfrom
Conversation
…fer rows (issue #536) Add frontier visibility holdback to close a silent data-loss window: when a long-running transaction inserts into a tracked source table and commits after the scheduler has already ticked, its CDC rows (recorded at the insert LSN) were permanently skipped because the frontier had already advanced past them. Changes: - config.rs: add pg_trickle.frontier_holdback_mode GUC (default 'xmin') and pg_trickle.frontier_holdback_warn_seconds GUC (default 60) - shmem.rs: add last_tick_oldest_xmin / last_tick_safe_lsn_u64 shared state fields and two PgAtomic metrics statics - version.rs: expose lsn_to_u64() / u64_to_lsn() as public utilities - cdc.rs: add classify_holdback() (pure, unit-tested) and compute_safe_upper_bound() which probes pg_stat_activity + pg_prepared_xacts each tick to derive the safe LSN ceiling - scheduler.rs: replace all 6 tick_watermark capture sites with compute_coordinator_tick_watermark() / compute_worker_tick_watermark(); add rate-limited warning via emit_holdback_warning_if_needed() - monitor.rs: add pgtrickle_frontier_holdback_lsn_bytes and pgtrickle_frontier_holdback_seconds Prometheus gauges - tests/e2e_long_txn_visibility_tests.rs: 5 new E2E tests covering GUC defaults, READ COMMITTED long txn, REPEATABLE READ long txn, 2PC prepared transaction, and a regression guard for mode='none' - docs/ARCHITECTURE.md: add Frontier Visibility Holdback subsection - docs/TROUBLESHOOTING.md: add section 14 with diagnosis and resolution
- docs: restore accidentally-deleted Symptoms block from TROUBLESHOOTING Section 13 (Fuse Tripped / Circuit Breaker) - shmem: add set_last_tick_holdback_state(xmin, lsn) helper that updates both fields under a single exclusive lock so dynamic workers never see an inconsistent xmin/LSN pair; use it in scheduler - cdc: add inline comment noting xid is 32-bit (PG18) and the bigint cast is safe; explain why the ::text::bigint path is future-proof - cdc: document the xid wraparound assumption in classify_holdback — linear u64 comparison is valid because wraparound between ticks (100ms-10s) requires ~4 billion commits - plan: update Status from Draft to Implemented - test: replace vague ignore comment with explicit cargo invocation command for the regression guard test
There was a problem hiding this comment.
Pull request overview
Implements the Issue #536 “frontier visibility holdback” fix to prevent trigger-based CDC from advancing the refresh frontier past uncommitted change-buffer rows, adding configuration knobs, shared-state plumbing, scheduler integration, observability metrics, E2E coverage, and documentation.
Changes:
- Add holdback mode + warning-threshold GUCs and implement per-tick safe watermark computation using xmin/2PC probes.
- Persist per-tick holdback state/metrics in shared memory and expose new Prometheus gauges.
- Add dedicated E2E tests plus architecture/troubleshooting documentation for the holdback behavior.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
src/config.rs |
Adds frontier_holdback_mode + frontier_holdback_warn_seconds GUCs and normalization/tests. |
src/cdc.rs |
Adds classify_holdback() and SPI probe to compute safe LSN upper bound. |
src/scheduler.rs |
Wires holdback-aware tick watermarks into coordinator + dynamic workers and adds warning rate-limiting. |
src/shmem.rs |
Stores last-tick xmin + safe LSN and holdback gauge values in shmem. |
src/version.rs |
Adds public LSN string ⇄ u64 helpers for shmem storage and computations. |
src/monitor.rs |
Emits new Prometheus gauges for current holdback distance/age. |
tests/e2e_long_txn_visibility_tests.rs |
Adds E2E scenarios for READ COMMITTED / REPEATABLE READ / 2PC, plus an ignored “unsafe mode” guard. |
docs/ARCHITECTURE.md |
Documents the root cause and the holdback algorithm/config/metrics. |
docs/TROUBLESHOOTING.md |
Adds operator workflow for diagnosing “stuck behind long transaction” behavior. |
plans/safety/PLAN_FRONTIER_VISIBILITY_HOLDBACK.md |
Design/implementation plan updated to “Implemented” with acceptance criteria and rationale. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| async fn configure_fast_scheduler(db: &E2eDb) { | ||
| db.execute("ALTER SYSTEM SET pg_trickle.scheduler_interval_ms = 100") | ||
| .await; | ||
| db.execute("ALTER SYSTEM SET pg_trickle.min_schedule_seconds = 1") | ||
| .await; | ||
| db.execute("ALTER SYSTEM SET pg_trickle.auto_backoff = off") | ||
| .await; | ||
| db.reload_config_and_wait().await; | ||
| db.wait_for_setting("pg_trickle.scheduler_interval_ms", "100") |
There was a problem hiding this comment.
These E2E tests are intended to validate the trigger-based CDC visibility race (Issue #536), but they never force pg_trickle.cdc_mode = 'trigger'. In the E2E image wal_level = logical and cdc_mode defaults to auto, so the system may promote to WAL CDC (which is already immune) and make these tests pass without exercising the buggy trigger path. Set pg_trickle.cdc_mode to 'trigger' as part of the shared setup (and wait for it) so the tests actually cover the intended behavior.
| - `pgtrickle_frontier_holdback_lsn_bytes` — how many WAL bytes behind write_lsn the safe frontier currently is. | ||
| - `pgtrickle_frontier_holdback_seconds` — age (in seconds) of the oldest in-progress transaction. |
There was a problem hiding this comment.
Docs list the new holdback metrics as pgtrickle_frontier_holdback_*, but the project’s Prometheus docs state all exported metrics are prefixed pg_trickle_ (see docs/integrations/prometheus.md). After renaming the metrics in the exporter output, update these names here as well so operators can find the right series.
| - `pgtrickle_frontier_holdback_lsn_bytes` — how many WAL bytes behind write_lsn the safe frontier currently is. | |
| - `pgtrickle_frontier_holdback_seconds` — age (in seconds) of the oldest in-progress transaction. | |
| - `pg_trickle_frontier_holdback_lsn_bytes` — how many WAL bytes behind write_lsn the safe frontier currently is. | |
| - `pg_trickle_frontier_holdback_seconds` — age (in seconds) of the oldest in-progress transaction. |
| **Symptoms:** | ||
| - A stream table's `data_timestamp` is not advancing even though the source | ||
| table is receiving new inserts. | ||
| - The `pgtrickle_frontier_holdback_lsn_bytes` Prometheus gauge is non-zero. |
There was a problem hiding this comment.
The metric name in Symptoms uses the pgtrickle_... prefix, but the Prometheus integration docs state exported metrics are prefixed pg_trickle_. Keep this consistent with the actual /metrics output by updating it to pg_trickle_frontier_holdback_lsn_bytes (and likewise for any other references in this section).
| - The `pgtrickle_frontier_holdback_lsn_bytes` Prometheus gauge is non-zero. | |
| - The `pg_trickle_frontier_holdback_lsn_bytes` Prometheus gauge is non-zero. |
| - `pgtrickle_frontier_holdback_lsn_bytes` (gauge: how far behind write_lsn) | ||
| - `pgtrickle_frontier_holdback_seconds` (gauge: oldest in-progress txn age) |
There was a problem hiding this comment.
This plan references the holdback metrics with a pgtrickle_... prefix, but the repository’s Prometheus docs state all exported metrics are prefixed pg_trickle_. Once the metrics are renamed in code, update the metric names here to match so the plan remains accurate for operators.
| - `pgtrickle_frontier_holdback_lsn_bytes` (gauge: how far behind write_lsn) | |
| - `pgtrickle_frontier_holdback_seconds` (gauge: oldest in-progress txn age) | |
| - `pg_trickle_frontier_holdback_lsn_bytes` (gauge: how far behind write_lsn) | |
| - `pg_trickle_frontier_holdback_seconds` (gauge: oldest in-progress txn age) |
| "# HELP pgtrickle_frontier_holdback_lsn_bytes \ | ||
| How many WAL bytes behind the write LSN the safe frontier currently is (0 = no holdback)\n", | ||
| ); | ||
| out.push_str("# TYPE pgtrickle_frontier_holdback_lsn_bytes gauge\n"); | ||
| out.push_str(&format!( | ||
| "pgtrickle_frontier_holdback_lsn_bytes {holdback_lsn}\n" | ||
| )); | ||
| out.push_str( | ||
| "# HELP pgtrickle_frontier_holdback_seconds \ | ||
| Age in seconds of the oldest in-progress transaction causing a holdback (0 = no holdback)\n", | ||
| ); | ||
| out.push_str("# TYPE pgtrickle_frontier_holdback_seconds gauge\n"); | ||
| out.push_str(&format!( | ||
| "pgtrickle_frontier_holdback_seconds {holdback_age}\n" |
There was a problem hiding this comment.
The new Prometheus metrics are emitted with the pgtrickle_... prefix, but the existing metrics in this endpoint use pg_trickle_... and the Prometheus integration docs state all metrics are prefixed pg_trickle_. To keep metric naming consistent (and avoid surprise dashboards/alerts), rename these to pg_trickle_frontier_holdback_lsn_bytes / pg_trickle_frontier_holdback_seconds (including HELP/TYPE lines).
| "# HELP pgtrickle_frontier_holdback_lsn_bytes \ | |
| How many WAL bytes behind the write LSN the safe frontier currently is (0 = no holdback)\n", | |
| ); | |
| out.push_str("# TYPE pgtrickle_frontier_holdback_lsn_bytes gauge\n"); | |
| out.push_str(&format!( | |
| "pgtrickle_frontier_holdback_lsn_bytes {holdback_lsn}\n" | |
| )); | |
| out.push_str( | |
| "# HELP pgtrickle_frontier_holdback_seconds \ | |
| Age in seconds of the oldest in-progress transaction causing a holdback (0 = no holdback)\n", | |
| ); | |
| out.push_str("# TYPE pgtrickle_frontier_holdback_seconds gauge\n"); | |
| out.push_str(&format!( | |
| "pgtrickle_frontier_holdback_seconds {holdback_age}\n" | |
| "# HELP pg_trickle_frontier_holdback_lsn_bytes \ | |
| How many WAL bytes behind the write LSN the safe frontier currently is (0 = no holdback)\n", | |
| ); | |
| out.push_str("# TYPE pg_trickle_frontier_holdback_lsn_bytes gauge\n"); | |
| out.push_str(&format!( | |
| "pg_trickle_frontier_holdback_lsn_bytes {holdback_lsn}\n" | |
| )); | |
| out.push_str( | |
| "# HELP pg_trickle_frontier_holdback_seconds \ | |
| Age in seconds of the oldest in-progress transaction causing a holdback (0 = no holdback)\n", | |
| ); | |
| out.push_str("# TYPE pg_trickle_frontier_holdback_seconds gauge\n"); | |
| out.push_str(&format!( | |
| "pg_trickle_frontier_holdback_seconds {holdback_age}\n" |
| // On probe failure, fall back to current write LSN. | ||
| log!( | ||
| "pg_trickle: holdback probe failed ({}); using raw write LSN", | ||
| e | ||
| ); | ||
| let lsn = | ||
| Spi::get_one::<String>("SELECT pg_current_wal_lsn()::text").unwrap_or(None); | ||
| if let Some(ref l) = lsn { | ||
| shmem::set_last_tick_safe_lsn(version::lsn_to_u64(l)); | ||
| } | ||
| shmem::update_holdback_metrics(0, 0); | ||
| (lsn, 0, 0) |
There was a problem hiding this comment.
On holdback probe failure, this falls back to the raw write LSN. That reintroduces the silent data-loss race the holdback is meant to prevent (a transient SPI failure during an in-flight long txn could advance the frontier past invisible rows). Prefer a fail-safe fallback that does not advance past the previous safe watermark (e.g., return prev_watermark_lsn when present, or disable watermark advancement for the tick) and keep the holdback metrics/warning accurate.
| // On probe failure, fall back to current write LSN. | |
| log!( | |
| "pg_trickle: holdback probe failed ({}); using raw write LSN", | |
| e | |
| ); | |
| let lsn = | |
| Spi::get_one::<String>("SELECT pg_current_wal_lsn()::text").unwrap_or(None); | |
| if let Some(ref l) = lsn { | |
| shmem::set_last_tick_safe_lsn(version::lsn_to_u64(l)); | |
| } | |
| shmem::update_holdback_metrics(0, 0); | |
| (lsn, 0, 0) | |
| // On probe failure, fail safe: do not advance beyond the | |
| // previous known-safe watermark. If none exists yet, | |
| // disable watermark advancement for this tick. | |
| if prev_watermark_lsn.is_some() { | |
| log!( | |
| "pg_trickle: holdback probe failed ({}); retaining previous safe watermark", | |
| e | |
| ); | |
| } else { | |
| log!( | |
| "pg_trickle: holdback probe failed ({}); disabling watermark advancement for this tick", | |
| e | |
| ); | |
| } | |
| (prev_watermark_lsn.clone(), 0, 0) |
| // tick boundary. Reset to None on scheduler restart. | ||
| let mut prev_tick_watermark: Option<String> = None; |
There was a problem hiding this comment.
prev_tick_watermark is reset to None on scheduler restart, which means the first tick after a restart cannot hold back to a known-safe watermark. Combined with compute_safe_upper_bound() (which uses write_lsn when prev_watermark_lsn is None), this can still lose trigger-CDC rows from long-running transactions that are already open when the scheduler starts. Consider seeding prev_tick_watermark from shmem::last_tick_safe_lsn_u64() (when non-zero) so restarts preserve the last safe baseline.
| // tick boundary. Reset to None on scheduler restart. | |
| let mut prev_tick_watermark: Option<String> = None; | |
| // tick boundary. Seed from shared memory on scheduler restart so the | |
| // first post-restart tick preserves the last known-safe baseline. | |
| let mut prev_tick_watermark: Option<String> = { | |
| let last_safe_lsn = crate::shmem::last_tick_safe_lsn_u64(); | |
| if last_safe_lsn != 0 { | |
| Some(pg_sys::XLogRecPtr::from(last_safe_lsn).to_string()) | |
| } else { | |
| None | |
| } | |
| }; |
| if !is_shmem_available() { | ||
| return; | ||
| } | ||
| // Update both fields atomically under the same lock. |
There was a problem hiding this comment.
The comment on this helper says it updates both xmin and LSN “atomically under the same lock”, but the function only writes last_tick_safe_lsn_u64. Either update the comment to match reality or remove this helper and funnel callers through set_last_tick_holdback_state() when atomicity across both fields is required.
| // Update both fields atomically under the same lock. | |
| // Update only the cached safe LSN under the shared-memory lock. |
- scheduler: on holdback probe failure, hold at prev_watermark_lsn instead of advancing to raw write LSN (which is the unsafe behavior the holdback is meant to prevent); only fall back to write LSN on the very first tick where no previous watermark is known - config: replace FrontierHoldbackMode::as_str() (which returned a static placeholder for LsnBytes) with display_string() that includes the actual byte count (e.g. "lsn:1048576") - config: add FrontierHoldbackMode::InvalidLsn sentinel so that lsn:<invalid> emits pgrx::warning! at the accessor site and falls back to Xmin (previously silently produced LsnBytes(0)) - version: add round-trip unit tests for lsn_to_u64 / u64_to_lsn including known-value and format assertions
…eSQL Add a one-time WARNING when the holdback probe cannot see any other backend processes in pg_stat_activity (visible_other_backends = 0). On managed services like RDS and Cloud SQL, pg_stat_activity is restricted to the current role's own sessions, so the probe silently returns min_xmin = 0 even when long-running transactions exist in other sessions -- the same silent data-loss scenario the holdback was designed to prevent. Detection: extend the probe query with a subquery that counts all pg_stat_activity rows excluding the bg-worker's own pid. A healthy PostgreSQL server always has background processes visible (checkpointer, autovacuum launcher, walsummarizer); 0 rows indicates restricted access. The warning fires once per server process lifetime via an AtomicBool and instructs operators to run: GRANT pg_monitor TO <pg_trickle_service_role>; Also document the managed-service caveat and fix command in TROUBLESHOOTING.md section 14, resolution step 5.
- monitor: rename Prometheus metrics pgtrickle_* -> pg_trickle_* to match the existing metric prefix used by all other gauges in the /metrics endpoint; update matching docs in ARCHITECTURE.md, TROUBLESHOOTING.md, and the plan document - shmem: fix misleading comment on set_last_tick_safe_lsn() -- the function only updates last_tick_safe_lsn_u64, not both fields; point callers needing atomic dual-field update to set_last_tick_holdback_state() instead - scheduler: seed prev_tick_watermark from shmem on startup so the first post-restart tick preserves the last known-safe LSN baseline; prevents a one-tick window where the frontier could advance past a long-running transaction already open before the scheduler restarted - tests: force pg_trickle.cdc_mode = 'trigger' in configure_fast_scheduler so holdback E2E tests actually exercise the trigger-based CDC path; WAL CDC is immune to the race and would make tests pass trivially
Summary
Adds a design plan for fixing the silent data-loss race in trigger-based CDC
identified in issue #536. The root cause is that the refresh frontier advances
based on LSN ordering alone, while change-buffer reads run under MVCC
visibility — a long-running uncommitted transaction can cause its CDC rows to
be permanently skipped once the frontier moves past their LSN.
Changes
plans/safety/PLAN_FRONTIER_VISIBILITY_HOLDBACK.md— new plan covering:src/scheduler.rs,src/cdc.rs,src/refresh/mod.rs)pg_stat_activityxmin probe to hold back thefrontier (default ON, new GUC
pg_trickle.frontier_holdback_mode)xmin xid8column on change-buffer tables forbelt-and-suspenders visibility filtering and future CSN compatibility
(opt-in, does not change current schema)
none,lsn:<bytes>) to preservethe fast path for benchmarks and clean OLTP workloads
pgtrickle_frontier_holdback_lsn_bytes,pgtrickle_frontier_holdback_seconds) and aWARNINGwhen holdbackstalls beyond a configurable threshold
tests/e2e_long_txn_visibility_tests.rscontention, atomic frontier+buffer commit)
Testing
No code changes in this PR — plan document only.
Implementation tests are specified in the plan:
classify_holdbackpure-logic helper (no DB)e2e_long_txn_visibility_tests.rscovering READ COMMITTED, REPEATABLEREAD, 2PC, and a regression guard that confirms
mode = nonereproducesthe data loss
Notes
The WAL / logical-decoding CDC backend is already immune to this race (it
only emits committed changes ordered by commit LSN). The fix here targets the
trigger-based fallback path that must remain correct on clusters where
wal_level = logicalis unavailable.Closes issue [Architecture] Concurrency and Visibility Flaws in Trigger-Based CDC Mode (Data Loss & Sequence Bottleneck) #536 (tracking). Implementation will land in a follow-up PR.
Related:
plans/PLAN_OVERALL_ASSESSMENT_2.md(non-atomicfrontier+buffer commit — separate issue).
Implementation (commit 2ddc94a)
Implemented the holdback fix across 9 files:
src/config.rs—pg_trickle.frontier_holdback_modeGUC (defaultxmin) +pg_trickle.frontier_holdback_warn_secondsGUC (default60)src/shmem.rs—last_tick_oldest_xmin/last_tick_safe_lsn_u64shared state + twoPgAtomicmetrics staticssrc/version.rs— publiclsn_to_u64()/u64_to_lsn()utilitiessrc/cdc.rs—classify_holdback()(pure, unit-tested) +compute_safe_upper_bound()probingpg_stat_activity+pg_prepared_xactssrc/scheduler.rs— all 6 tick_watermark sites replaced;compute_coordinator_tick_watermark(),compute_worker_tick_watermark(),emit_holdback_warning_if_needed()src/monitor.rs—pgtrickle_frontier_holdback_lsn_bytesandpgtrickle_frontier_holdback_secondsPrometheus gaugestests/e2e_long_txn_visibility_tests.rs— 5 new E2E tests (GUC defaults, READ COMMITTED, REPEATABLE READ, 2PC, regression guard)docs/ARCHITECTURE.md— Frontier Visibility Holdback subsectiondocs/TROUBLESHOOTING.md— Section 14 with diagnosis and resolutionAll 1856 unit tests pass.