Skip to content

fix(cdc): prevent frontier from advancing past uncommitted change-buffer rows (issue #536)#593

Open
BaardBouvet wants to merge 7 commits intomainfrom
docs/frontier-visibility-holdback-plan
Open

fix(cdc): prevent frontier from advancing past uncommitted change-buffer rows (issue #536)#593
BaardBouvet wants to merge 7 commits intomainfrom
docs/frontier-visibility-holdback-plan

Conversation

@BaardBouvet
Copy link
Copy Markdown
Collaborator

@BaardBouvet BaardBouvet commented Apr 19, 2026

Summary

Adds a design plan for fixing the silent data-loss race in trigger-based CDC
identified in issue #536. The root cause is that the refresh frontier advances
based on LSN ordering alone, while change-buffer reads run under MVCC
visibility — a long-running uncommitted transaction can cause its CDC rows to
be permanently skipped once the frontier moves past their LSN.

Changes

  • plans/safety/PLAN_FRONTIER_VISIBILITY_HOLDBACK.md — new plan covering:
    • Root-cause analysis with verified code references (src/scheduler.rs,
      src/cdc.rs, src/refresh/mod.rs)
    • Layer A: per-tick pg_stat_activity xmin probe to hold back the
      frontier (default ON, new GUC pg_trickle.frontier_holdback_mode)
    • Layer B: optional xmin xid8 column on change-buffer tables for
      belt-and-suspenders visibility filtering and future CSN compatibility
      (opt-in, does not change current schema)
    • Layer C: escape-hatch GUC values (none, lsn:<bytes>) to preserve
      the fast path for benchmarks and clean OLTP workloads
    • Metrics (pgtrickle_frontier_holdback_lsn_bytes,
      pgtrickle_frontier_holdback_seconds) and a WARNING when holdback
      stalls beyond a configurable threshold
    • Implementation steps with concrete file targets
    • Acceptance criteria and required test coverage including a dedicated
      tests/e2e_long_txn_visibility_tests.rs
    • Risk / trade-off table and out-of-scope follow-ups (sequence cache
      contention, atomic frontier+buffer commit)

Testing

No code changes in this PR — plan document only.
Implementation tests are specified in the plan:

  • Unit: classify_holdback pure-logic helper (no DB)
  • Integration: open-transaction + scheduler tick straddling scenario
  • E2E: e2e_long_txn_visibility_tests.rs covering READ COMMITTED, REPEATABLE
    READ, 2PC, and a regression guard that confirms mode = none reproduces
    the data loss

Notes

Implementation (commit 2ddc94a)

Implemented the holdback fix across 9 files:

  • src/config.rspg_trickle.frontier_holdback_mode GUC (default xmin) + pg_trickle.frontier_holdback_warn_seconds GUC (default 60)
  • src/shmem.rslast_tick_oldest_xmin / last_tick_safe_lsn_u64 shared state + two PgAtomic metrics statics
  • src/version.rs — public lsn_to_u64() / u64_to_lsn() utilities
  • src/cdc.rsclassify_holdback() (pure, unit-tested) + compute_safe_upper_bound() probing pg_stat_activity + pg_prepared_xacts
  • src/scheduler.rs — all 6 tick_watermark sites replaced; compute_coordinator_tick_watermark(), compute_worker_tick_watermark(), emit_holdback_warning_if_needed()
  • src/monitor.rspgtrickle_frontier_holdback_lsn_bytes and pgtrickle_frontier_holdback_seconds Prometheus gauges
  • tests/e2e_long_txn_visibility_tests.rs — 5 new E2E tests (GUC defaults, READ COMMITTED, REPEATABLE READ, 2PC, regression guard)
  • docs/ARCHITECTURE.md — Frontier Visibility Holdback subsection
  • docs/TROUBLESHOOTING.md — Section 14 with diagnosis and resolution

All 1856 unit tests pass.

@BaardBouvet BaardBouvet requested a review from grove as a code owner April 19, 2026 19:17
…fer rows (issue #536)

Add frontier visibility holdback to close a silent data-loss window:
when a long-running transaction inserts into a tracked source table and
commits after the scheduler has already ticked, its CDC rows (recorded
at the insert LSN) were permanently skipped because the frontier had
already advanced past them.

Changes:
- config.rs: add pg_trickle.frontier_holdback_mode GUC (default 'xmin')
  and pg_trickle.frontier_holdback_warn_seconds GUC (default 60)
- shmem.rs: add last_tick_oldest_xmin / last_tick_safe_lsn_u64 shared
  state fields and two PgAtomic metrics statics
- version.rs: expose lsn_to_u64() / u64_to_lsn() as public utilities
- cdc.rs: add classify_holdback() (pure, unit-tested) and
  compute_safe_upper_bound() which probes pg_stat_activity +
  pg_prepared_xacts each tick to derive the safe LSN ceiling
- scheduler.rs: replace all 6 tick_watermark capture sites with
  compute_coordinator_tick_watermark() / compute_worker_tick_watermark();
  add rate-limited warning via emit_holdback_warning_if_needed()
- monitor.rs: add pgtrickle_frontier_holdback_lsn_bytes and
  pgtrickle_frontier_holdback_seconds Prometheus gauges
- tests/e2e_long_txn_visibility_tests.rs: 5 new E2E tests covering GUC
  defaults, READ COMMITTED long txn, REPEATABLE READ long txn, 2PC
  prepared transaction, and a regression guard for mode='none'
- docs/ARCHITECTURE.md: add Frontier Visibility Holdback subsection
- docs/TROUBLESHOOTING.md: add section 14 with diagnosis and resolution
@BaardBouvet BaardBouvet changed the title docs(safety): add frontier visibility holdback plan (issue #536) fix(cdc): prevent frontier from advancing past uncommitted change-buffer rows (issue #536) Apr 19, 2026
- docs: restore accidentally-deleted Symptoms block from TROUBLESHOOTING
  Section 13 (Fuse Tripped / Circuit Breaker)
- shmem: add set_last_tick_holdback_state(xmin, lsn) helper that updates
  both fields under a single exclusive lock so dynamic workers never see
  an inconsistent xmin/LSN pair; use it in scheduler
- cdc: add inline comment noting xid is 32-bit (PG18) and the bigint
  cast is safe; explain why the ::text::bigint path is future-proof
- cdc: document the xid wraparound assumption in classify_holdback —
  linear u64 comparison is valid because wraparound between ticks
  (100ms-10s) requires ~4 billion commits
- plan: update Status from Draft to Implemented
- test: replace vague ignore comment with explicit cargo invocation
  command for the regression guard test
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements the Issue #536 “frontier visibility holdback” fix to prevent trigger-based CDC from advancing the refresh frontier past uncommitted change-buffer rows, adding configuration knobs, shared-state plumbing, scheduler integration, observability metrics, E2E coverage, and documentation.

Changes:

  • Add holdback mode + warning-threshold GUCs and implement per-tick safe watermark computation using xmin/2PC probes.
  • Persist per-tick holdback state/metrics in shared memory and expose new Prometheus gauges.
  • Add dedicated E2E tests plus architecture/troubleshooting documentation for the holdback behavior.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/config.rs Adds frontier_holdback_mode + frontier_holdback_warn_seconds GUCs and normalization/tests.
src/cdc.rs Adds classify_holdback() and SPI probe to compute safe LSN upper bound.
src/scheduler.rs Wires holdback-aware tick watermarks into coordinator + dynamic workers and adds warning rate-limiting.
src/shmem.rs Stores last-tick xmin + safe LSN and holdback gauge values in shmem.
src/version.rs Adds public LSN string ⇄ u64 helpers for shmem storage and computations.
src/monitor.rs Emits new Prometheus gauges for current holdback distance/age.
tests/e2e_long_txn_visibility_tests.rs Adds E2E scenarios for READ COMMITTED / REPEATABLE READ / 2PC, plus an ignored “unsafe mode” guard.
docs/ARCHITECTURE.md Documents the root cause and the holdback algorithm/config/metrics.
docs/TROUBLESHOOTING.md Adds operator workflow for diagnosing “stuck behind long transaction” behavior.
plans/safety/PLAN_FRONTIER_VISIBILITY_HOLDBACK.md Design/implementation plan updated to “Implemented” with acceptance criteria and rationale.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +38 to +46
async fn configure_fast_scheduler(db: &E2eDb) {
db.execute("ALTER SYSTEM SET pg_trickle.scheduler_interval_ms = 100")
.await;
db.execute("ALTER SYSTEM SET pg_trickle.min_schedule_seconds = 1")
.await;
db.execute("ALTER SYSTEM SET pg_trickle.auto_backoff = off")
.await;
db.reload_config_and_wait().await;
db.wait_for_setting("pg_trickle.scheduler_interval_ms", "100")
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These E2E tests are intended to validate the trigger-based CDC visibility race (Issue #536), but they never force pg_trickle.cdc_mode = 'trigger'. In the E2E image wal_level = logical and cdc_mode defaults to auto, so the system may promote to WAL CDC (which is already immune) and make these tests pass without exercising the buggy trigger path. Set pg_trickle.cdc_mode to 'trigger' as part of the shared setup (and wait for it) so the tests actually cover the intended behavior.

Copilot uses AI. Check for mistakes.
Comment thread docs/ARCHITECTURE.md Outdated
Comment on lines +222 to +223
- `pgtrickle_frontier_holdback_lsn_bytes` — how many WAL bytes behind write_lsn the safe frontier currently is.
- `pgtrickle_frontier_holdback_seconds` — age (in seconds) of the oldest in-progress transaction.
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs list the new holdback metrics as pgtrickle_frontier_holdback_*, but the project’s Prometheus docs state all exported metrics are prefixed pg_trickle_ (see docs/integrations/prometheus.md). After renaming the metrics in the exporter output, update these names here as well so operators can find the right series.

Suggested change
- `pgtrickle_frontier_holdback_lsn_bytes` — how many WAL bytes behind write_lsn the safe frontier currently is.
- `pgtrickle_frontier_holdback_seconds` — age (in seconds) of the oldest in-progress transaction.
- `pg_trickle_frontier_holdback_lsn_bytes` — how many WAL bytes behind write_lsn the safe frontier currently is.
- `pg_trickle_frontier_holdback_seconds` — age (in seconds) of the oldest in-progress transaction.

Copilot uses AI. Check for mistakes.
Comment thread docs/TROUBLESHOOTING.md Outdated
**Symptoms:**
- A stream table's `data_timestamp` is not advancing even though the source
table is receiving new inserts.
- The `pgtrickle_frontier_holdback_lsn_bytes` Prometheus gauge is non-zero.
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metric name in Symptoms uses the pgtrickle_... prefix, but the Prometheus integration docs state exported metrics are prefixed pg_trickle_. Keep this consistent with the actual /metrics output by updating it to pg_trickle_frontier_holdback_lsn_bytes (and likewise for any other references in this section).

Suggested change
- The `pgtrickle_frontier_holdback_lsn_bytes` Prometheus gauge is non-zero.
- The `pg_trickle_frontier_holdback_lsn_bytes` Prometheus gauge is non-zero.

Copilot uses AI. Check for mistakes.
Comment on lines +165 to +166
- `pgtrickle_frontier_holdback_lsn_bytes` (gauge: how far behind write_lsn)
- `pgtrickle_frontier_holdback_seconds` (gauge: oldest in-progress txn age)
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This plan references the holdback metrics with a pgtrickle_... prefix, but the repository’s Prometheus docs state all exported metrics are prefixed pg_trickle_. Once the metrics are renamed in code, update the metric names here to match so the plan remains accurate for operators.

Suggested change
- `pgtrickle_frontier_holdback_lsn_bytes` (gauge: how far behind write_lsn)
- `pgtrickle_frontier_holdback_seconds` (gauge: oldest in-progress txn age)
- `pg_trickle_frontier_holdback_lsn_bytes` (gauge: how far behind write_lsn)
- `pg_trickle_frontier_holdback_seconds` (gauge: oldest in-progress txn age)

Copilot uses AI. Check for mistakes.
Comment thread src/monitor.rs Outdated
Comment on lines +702 to +715
"# HELP pgtrickle_frontier_holdback_lsn_bytes \
How many WAL bytes behind the write LSN the safe frontier currently is (0 = no holdback)\n",
);
out.push_str("# TYPE pgtrickle_frontier_holdback_lsn_bytes gauge\n");
out.push_str(&format!(
"pgtrickle_frontier_holdback_lsn_bytes {holdback_lsn}\n"
));
out.push_str(
"# HELP pgtrickle_frontier_holdback_seconds \
Age in seconds of the oldest in-progress transaction causing a holdback (0 = no holdback)\n",
);
out.push_str("# TYPE pgtrickle_frontier_holdback_seconds gauge\n");
out.push_str(&format!(
"pgtrickle_frontier_holdback_seconds {holdback_age}\n"
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new Prometheus metrics are emitted with the pgtrickle_... prefix, but the existing metrics in this endpoint use pg_trickle_... and the Prometheus integration docs state all metrics are prefixed pg_trickle_. To keep metric naming consistent (and avoid surprise dashboards/alerts), rename these to pg_trickle_frontier_holdback_lsn_bytes / pg_trickle_frontier_holdback_seconds (including HELP/TYPE lines).

Suggested change
"# HELP pgtrickle_frontier_holdback_lsn_bytes \
How many WAL bytes behind the write LSN the safe frontier currently is (0 = no holdback)\n",
);
out.push_str("# TYPE pgtrickle_frontier_holdback_lsn_bytes gauge\n");
out.push_str(&format!(
"pgtrickle_frontier_holdback_lsn_bytes {holdback_lsn}\n"
));
out.push_str(
"# HELP pgtrickle_frontier_holdback_seconds \
Age in seconds of the oldest in-progress transaction causing a holdback (0 = no holdback)\n",
);
out.push_str("# TYPE pgtrickle_frontier_holdback_seconds gauge\n");
out.push_str(&format!(
"pgtrickle_frontier_holdback_seconds {holdback_age}\n"
"# HELP pg_trickle_frontier_holdback_lsn_bytes \
How many WAL bytes behind the write LSN the safe frontier currently is (0 = no holdback)\n",
);
out.push_str("# TYPE pg_trickle_frontier_holdback_lsn_bytes gauge\n");
out.push_str(&format!(
"pg_trickle_frontier_holdback_lsn_bytes {holdback_lsn}\n"
));
out.push_str(
"# HELP pg_trickle_frontier_holdback_seconds \
Age in seconds of the oldest in-progress transaction causing a holdback (0 = no holdback)\n",
);
out.push_str("# TYPE pg_trickle_frontier_holdback_seconds gauge\n");
out.push_str(&format!(
"pg_trickle_frontier_holdback_seconds {holdback_age}\n"

Copilot uses AI. Check for mistakes.
Comment thread src/scheduler.rs Outdated
Comment on lines +798 to +809
// On probe failure, fall back to current write LSN.
log!(
"pg_trickle: holdback probe failed ({}); using raw write LSN",
e
);
let lsn =
Spi::get_one::<String>("SELECT pg_current_wal_lsn()::text").unwrap_or(None);
if let Some(ref l) = lsn {
shmem::set_last_tick_safe_lsn(version::lsn_to_u64(l));
}
shmem::update_holdback_metrics(0, 0);
(lsn, 0, 0)
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On holdback probe failure, this falls back to the raw write LSN. That reintroduces the silent data-loss race the holdback is meant to prevent (a transient SPI failure during an in-flight long txn could advance the frontier past invisible rows). Prefer a fail-safe fallback that does not advance past the previous safe watermark (e.g., return prev_watermark_lsn when present, or disable watermark advancement for the tick) and keep the holdback metrics/warning accurate.

Suggested change
// On probe failure, fall back to current write LSN.
log!(
"pg_trickle: holdback probe failed ({}); using raw write LSN",
e
);
let lsn =
Spi::get_one::<String>("SELECT pg_current_wal_lsn()::text").unwrap_or(None);
if let Some(ref l) = lsn {
shmem::set_last_tick_safe_lsn(version::lsn_to_u64(l));
}
shmem::update_holdback_metrics(0, 0);
(lsn, 0, 0)
// On probe failure, fail safe: do not advance beyond the
// previous known-safe watermark. If none exists yet,
// disable watermark advancement for this tick.
if prev_watermark_lsn.is_some() {
log!(
"pg_trickle: holdback probe failed ({}); retaining previous safe watermark",
e
);
} else {
log!(
"pg_trickle: holdback probe failed ({}); disabling watermark advancement for this tick",
e
);
}
(prev_watermark_lsn.clone(), 0, 0)

Copilot uses AI. Check for mistakes.
Comment thread src/scheduler.rs Outdated
Comment on lines +2321 to +2322
// tick boundary. Reset to None on scheduler restart.
let mut prev_tick_watermark: Option<String> = None;
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prev_tick_watermark is reset to None on scheduler restart, which means the first tick after a restart cannot hold back to a known-safe watermark. Combined with compute_safe_upper_bound() (which uses write_lsn when prev_watermark_lsn is None), this can still lose trigger-CDC rows from long-running transactions that are already open when the scheduler starts. Consider seeding prev_tick_watermark from shmem::last_tick_safe_lsn_u64() (when non-zero) so restarts preserve the last safe baseline.

Suggested change
// tick boundary. Reset to None on scheduler restart.
let mut prev_tick_watermark: Option<String> = None;
// tick boundary. Seed from shared memory on scheduler restart so the
// first post-restart tick preserves the last known-safe baseline.
let mut prev_tick_watermark: Option<String> = {
let last_safe_lsn = crate::shmem::last_tick_safe_lsn_u64();
if last_safe_lsn != 0 {
Some(pg_sys::XLogRecPtr::from(last_safe_lsn).to_string())
} else {
None
}
};

Copilot uses AI. Check for mistakes.
Comment thread src/shmem.rs Outdated
if !is_shmem_available() {
return;
}
// Update both fields atomically under the same lock.
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on this helper says it updates both xmin and LSN “atomically under the same lock”, but the function only writes last_tick_safe_lsn_u64. Either update the comment to match reality or remove this helper and funnel callers through set_last_tick_holdback_state() when atomicity across both fields is required.

Suggested change
// Update both fields atomically under the same lock.
// Update only the cached safe LSN under the shared-memory lock.

Copilot uses AI. Check for mistakes.
- scheduler: on holdback probe failure, hold at prev_watermark_lsn
  instead of advancing to raw write LSN (which is the unsafe behavior
  the holdback is meant to prevent); only fall back to write LSN on
  the very first tick where no previous watermark is known
- config: replace FrontierHoldbackMode::as_str() (which returned a
  static placeholder for LsnBytes) with display_string() that includes
  the actual byte count (e.g. "lsn:1048576")
- config: add FrontierHoldbackMode::InvalidLsn sentinel so that
  lsn:<invalid> emits pgrx::warning! at the accessor site and falls
  back to Xmin (previously silently produced LsnBytes(0))
- version: add round-trip unit tests for lsn_to_u64 / u64_to_lsn
  including known-value and format assertions
…eSQL

Add a one-time WARNING when the holdback probe cannot see any other
backend processes in pg_stat_activity (visible_other_backends = 0).
On managed services like RDS and Cloud SQL, pg_stat_activity is
restricted to the current role's own sessions, so the probe silently
returns min_xmin = 0 even when long-running transactions exist in
other sessions -- the same silent data-loss scenario the holdback was
designed to prevent.

Detection: extend the probe query with a subquery that counts all
pg_stat_activity rows excluding the bg-worker's own pid. A healthy
PostgreSQL server always has background processes visible
(checkpointer, autovacuum launcher, walsummarizer); 0 rows indicates
restricted access. The warning fires once per server process lifetime
via an AtomicBool and instructs operators to run:
  GRANT pg_monitor TO <pg_trickle_service_role>;

Also document the managed-service caveat and fix command in
TROUBLESHOOTING.md section 14, resolution step 5.
- monitor: rename Prometheus metrics pgtrickle_* -> pg_trickle_* to
  match the existing metric prefix used by all other gauges in the
  /metrics endpoint; update matching docs in ARCHITECTURE.md,
  TROUBLESHOOTING.md, and the plan document
- shmem: fix misleading comment on set_last_tick_safe_lsn() -- the
  function only updates last_tick_safe_lsn_u64, not both fields;
  point callers needing atomic dual-field update to
  set_last_tick_holdback_state() instead
- scheduler: seed prev_tick_watermark from shmem on startup so the
  first post-restart tick preserves the last known-safe LSN baseline;
  prevents a one-tick window where the frontier could advance past a
  long-running transaction already open before the scheduler restarted
- tests: force pg_trickle.cdc_mode = 'trigger' in configure_fast_scheduler
  so holdback E2E tests actually exercise the trigger-based CDC path;
  WAL CDC is immune to the race and would make tests pass trivially
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants