Skip to content

fix: coalesce ledger WAL flushes and extend to non-authority nodes#1218

Open
skylar-simoncelli wants to merge 3 commits intomainfrom
skylar/improve-ledger-wal-flush
Open

fix: coalesce ledger WAL flushes and extend to non-authority nodes#1218
skylar-simoncelli wants to merge 3 commits intomainfrom
skylar/improve-ledger-wal-flush

Conversation

@skylar-simoncelli
Copy link
Copy Markdown
Contributor

Overview

Note: This PR is based on feat/ledger_enact_parity_db_logs and should be reviewed alongside that branch. It adds three improvements to the ledger parity-db WAL flush task.

Changes

1. Flush coalescing (prevents task queue buildup)

The current code in feat/ledger_enact_parity_db_logs spawns a new spawn_blocking task for every BlockOrigin::Own notification. If a flush takes longer than the 6-second block interval, tasks accumulate without bound.

Added an AtomicBool flag (flush_in_progress) — if a flush is already running when the next notification arrives, we skip rather than spawn another task. This guarantees at most one flush is running at any time.

2. Non-authority node coverage

The current code only flushes on BlockOrigin::Own, which means non-authority nodes (RPCs, bootnodes, bridges, semi-trusted RPCs) never trigger a flush. Their ledger WAL grows until the 64 MB threshold causes a synchronous stall, which can cause:

  • RPC request timeouts
  • Peer disconnections when the node appears unresponsive

Non-authority nodes now flush every 50 imported blocks — frequent enough to prevent WAL buildup, infrequent enough to avoid excessive I/O.

3. Task moved outside authority gate

The flush task is no longer inside if role.is_authority(), so it runs on all node types.

Relationship to other PRs

Together these two fixes address the full parity-db WAL problem:

Context

Discovered during investigation of chain-state truncation after unclean shutdown (#1140). While testing on guardnet, we measured every node having ~9,000-10,000 blocks of metadata sitting only in the WAL at any given time, confirming the WAL accumulation problem.

📌 Submission Checklist

  • Changes are backward-compatible (or flagged if breaking)
  • Pull request description explains why the change is needed
  • Self-reviewed the diff
  • I have included a change file, or skipped for this reason: improvement to unreleased feature branch
  • If the changes introduce a new feature, I have bumped the node minor version
  • No new todos introduced

🧪 Testing Evidence

Logic-only change to an unreleased feature. The coalescing and interval flush are safe additive behaviors on top of the existing flush mechanism.

🔱 Fork Strategy

  • Node Client Update

Klapeyron and others added 2 commits March 30, 2026 10:04
Three improvements to the ledger parity-db WAL flush task:

1. Add flush coalescing via AtomicBool — if a flush is already in
   progress when the next block notification arrives, skip rather
   than spawning another spawn_blocking task. Prevents unbounded
   task queue buildup when flush duration exceeds the 6-second
   block interval.

2. Extend WAL flushing to non-authority nodes (RPCs, bootnodes,
   bridges). These nodes never author blocks so BlockOrigin::Own
   never matches, leaving their WAL to grow until the 64 MB
   threshold causes a synchronous stall. Non-authority nodes now
   flush every 50 imported blocks.

3. Move the flush task outside the `if role.is_authority()` block
   so it runs for all node types.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants