Description
Hi pg-trickle team! Thank you for this amazing project. We've been deeply inspired by the Differential Dataflow architecture.
However, when evaluating the trigger-based CDC logic (src/cdc.rs) under extreme high-concurrency OLTP workloads, we identified two critical architectural flaws caused by the inherent limitations of PostgreSQL's extension/trigger mechanisms. We'd like to share these findings and our kernel-level (Executor Hook) approach for discussion.
Issue 1: The "Long Transaction LSN Black Hole" (Silent Data Loss)
In trigger mode, the buffer table relies on pg_current_wal_insert_lsn() and filters incremental data purely by LSN boundaries (lsn > prev_lsn AND lsn <= tick_watermark) during scheduling, bypassing XID snapshot visibility checks for performance. The Anomaly:
- A transaction updates a row at $T_1$, acquiring an early LSN=100, but remains uncommitted (Long Transaction).
- The scheduler wakes up at $T_2$, setting the upper boundary to LSN=500. It skips the LSN=100 row because it's invisible under MVCC. The scheduler then advances the lower safety bound (prev_lsn) to 500.
- The long transaction finally commits at $T_3$.
- In the next refresh, the scheduler scans WHERE lsn > 500. The previously committed row (LSN=100) is permanently skipped, resulting in silent delta data loss.
Issue 2: Sequence Cache Inversion vs. Global Lock Bottleneck
Because statement-level triggers might fetch identical LSNs for bulk row updates, pg-trickle strictly relies on change_id BIGSERIAL to sort micro-sequences (e.g., retrieving rn_asc=1 and rn_desc=1 during compaction). The Anomaly: To prevent the notorious Sequence Cache Inversion (where Tx2 commits before Tx1, but Tx1 previously cached a smaller sequence number, causing chronological state overwriting), BIGSERIAL explicitly relies on PostgreSQL's default CACHE 1 behavior. Consequently, any high-concurrency DML on base tables is severely throttled by synchronous contention over the global sequence_rel LWLock during MLOG insertions. The TP performance bottleneck is inevitable.
Our Proposed Solution (Kernel Executor Hook Approach)
In our internal implementation build (where we have the privilege to modify the PG kernel), we completely abandoned the trigger + BIGSERIAL patterns to bypass these paradoxes:
- Abandoning Statement Triggers for Executor Hooks: Instead of SQL-level triggers, we intercept data at the Executor node (ExecInsert/ExecUpdate). We capture the precise, unique physical LSN of the generated WAL record for the specific HeapTuple.
- Abandoning Sequences for Physical LSN Ordering: Since the physical tuple WAL LSN guarantees absolute, monotonically increasing uniqueness even within the same bulk statement, we use it natively to order row-level mutations, safely abandoning BIGSERIAL and its CACHE 1 lock contention.
- Reintroducing XID for Visibility: To patch the LSN black hole, we log txid$ and rely strictly on MVCC snapshots (with plans to migrate to a global CSN mapping in the future) to filter uncommitted data, ensuring no delta sets leak through the scheduler's threshold.
Finally, we completely agree that pushing WAL Logical Decoding to production as the primary CDC backend is the ultimate endgame to resolve these synchronization paradoxes.
Would love to hear your thoughts on these edge cases and whether there are any interim safeguards planned for the trigger mode!
Description
Hi pg-trickle team! Thank you for this amazing project. We've been deeply inspired by the Differential Dataflow architecture.
However, when evaluating the trigger-based CDC logic (src/cdc.rs) under extreme high-concurrency OLTP workloads, we identified two critical architectural flaws caused by the inherent limitations of PostgreSQL's extension/trigger mechanisms. We'd like to share these findings and our kernel-level (Executor Hook) approach for discussion.
Issue 1: The "Long Transaction LSN Black Hole" (Silent Data Loss)
In trigger mode, the buffer table relies on pg_current_wal_insert_lsn() and filters incremental data purely by LSN boundaries (lsn > prev_lsn AND lsn <= tick_watermark) during scheduling, bypassing XID snapshot visibility checks for performance. The Anomaly:
Issue 2: Sequence Cache Inversion vs. Global Lock Bottleneck
Because statement-level triggers might fetch identical LSNs for bulk row updates, pg-trickle strictly relies on change_id BIGSERIAL to sort micro-sequences (e.g., retrieving rn_asc=1 and rn_desc=1 during compaction). The Anomaly: To prevent the notorious Sequence Cache Inversion (where Tx2 commits before Tx1, but Tx1 previously cached a smaller sequence number, causing chronological state overwriting), BIGSERIAL explicitly relies on PostgreSQL's default CACHE 1 behavior. Consequently, any high-concurrency DML on base tables is severely throttled by synchronous contention over the global sequence_rel LWLock during MLOG insertions. The TP performance bottleneck is inevitable.
Our Proposed Solution (Kernel Executor Hook Approach)
In our internal implementation build (where we have the privilege to modify the PG kernel), we completely abandoned the trigger + BIGSERIAL patterns to bypass these paradoxes:
Finally, we completely agree that pushing WAL Logical Decoding to production as the primary CDC backend is the ultimate endgame to resolve these synchronization paradoxes.
Would love to hear your thoughts on these edge cases and whether there are any interim safeguards planned for the trigger mode!