Skip to content

fix: parachain warm-restart hang#3210

Closed
replghost wants to merge 12 commits intomainfrom
fix/parachain-warm-start
Closed

fix: parachain warm-restart hang#3210
replghost wants to merge 12 commits intomainfrom
fix/parachain-warm-start

Conversation

@replghost
Copy link
Copy Markdown
Contributor

@replghost replghost commented Apr 21, 2026

Summary

Fix parachain warm-restart hang on fetch_parachain_head_from_relay, without regressing cold start, by moving the warp-sync wait into runtime_service and emitting a synthetic Finalized event on each new subscription.

Problem

fetch_parachain_head_from_relay() ignored the finalized header already returned by subscribe_all() and awaited a new Finalized notification via subscription.new_blocks.next().await. On warm restart the relay is
already synced, so nothing arrives until the next GrandPa round (~6s on Polkadot, longer with slow peers), and parachain sync hangs at "Waiting for relay chain to finalize a block..."

Fix

In runtime_service:

  1. Gate subscribe_all on warp-sync completion. Every subscription it returns has a post-warp finalized header.
  2. On each new subscription, push a synthetic Notification::Finalized for the current finalized block as the first stream event.

Consumers waiting on Notification::Finalized (e.g. the parachain task) now receive one immediately.

API change

New public method on SyncService:

pub async fn wait_warp_sync_finished(&self);

Smoke test

e2e-tests/tests/smoke.rs (added in #3234) didn't exercise the warp-sync path. On a freshly-spawned zombienet, peers report finalized below smoldot's warp_sync_minimum_gap = 32, all-forks catches up via normal GRANDPA finality before warp sync can engage, and wait_warp_sync_finished() never resolves.

Updated the test to wait for the relay to finalize 50 blocks past baseline before launching smoldot, so warp sync has a real target and WarpSyncFinished fires.

Credits

  • @replghost — original investigation, warm-restart fix, refactor (try_fetch_parachain_head /wait_for_finalized_hash)
  • @lrubasze — cold-start staleness diagnosis, wait_warp_sync_finished() gating

On warm restart from databaseContent, the relay chain may already be
synced. fetch_parachain_head_from_relay() was waiting for a NEW
Notification::Finalized event from subscribe_all(), which might not
arrive for seconds (or indefinitely if the relay sync stalls).

The fix: try the already-finalized block from subscribe_all immediately
before waiting for new notifications. This is the block that's already
available in subscription.finalized_block_scale_encoded_header.

Before: parachain warm restart NEVER initialized (>5min timeout)
After:  parachain warm restart initializes in ~3s

The runtime hint verification in bootstrap_parachain_consensus already
handles reusing the cached runtime from databaseContent — it verifies
the merkle value and skips the ~2MB download when it matches.

Fixes #3204.
platform,
Info,
log_target,
"Waiting for relay chain to finalize a block..."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be moved on the else branch where the subscription.next is polled

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — the log now only fires in wait_for_finalized_hash(), which is only called when the initial attempt (using the already-available finalized block) fails. The function was also restructured: try_fetch_parachain_head() attempts the fetch and returns Option, the main loop tries the initial finalized block first, and only falls through to wait_for_finalized_hash() on failure.

Comment thread light-base/src/sync_service/parachain.rs Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes parachain warm-restart hangs by using the already-known finalized relay-chain block returned by subscribe_all() to fetch the parachain head immediately, instead of always waiting for a new Finalized notification (which may not arrive promptly when the relay chain is already synced).

Changes:

  • Attempt fetching the parachain head using subscription.finalized_block_scale_encoded_header first.
  • Fall back to waiting for new Finalized notifications as before.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread light-base/src/sync_service/parachain.rs
@replghost replghost changed the title fix(sync-service): use initial finalized block for parachain head fetch fix(sync-service): try initial finalized block before waiting for new notifications Apr 22, 2026
@replghost
Copy link
Copy Markdown
Contributor Author

@lexnv friendly ping — I've addressed your review comments, would you mind taking another look when you get a chance?

…d block

On cold start, `try_initial_finalized` could resolve with the chain-spec
checkpoint as "finalized" before warp sync promoted it to the warp target.
When the checkpoint lags head (e.g. Paseo, ~70k relay blocks, as of now), the
parachain bootstraps from a months-stale head and then chases ~200k
para-blocks of finality, exceeding the 120s benchmark timeout.

Gate on a new wait_warp_sync_finished() signal; warm restart keeps its
zero-wait path.
@lrubasze
Copy link
Copy Markdown
Contributor

lrubasze commented Apr 28, 2026

Benchmark results

Paseo + Asset Hub Paseo, time-to-finalized, n=10 iterations.

Mode Baseline v3.1.1 (median / p95 / max) PR #3210 (median / p95 / max) Speedup (median / p95)
Cold 16.5s / 50.0s / 70.0s 5.5s / 11.2s / 12.5s 3.0× / 4.5×
Warm 17.3s / 43.5s / 50.3s 4.2s / 4.7s / 4.8s 4.1× / 9.3×

Highlights:

  • Warm-start variance collapses: stddev 13.4s → 0.7s. The hang-until-next-GrandPa-round behavior is gone — every run is now fast.
  • Cold start is 3× faster median, 4.5× at p95, no runs near the 120s timeout (worst 12.5s vs baseline 70s).

benchmarked with: #3233

`client.terminate()` waits on the WASM executor-shutdown event, which
can park for several minutes on a multi-chain client. Skip it and exit
with the assertion outcome — there's nothing to clean up beyond the
process lifetime in CI.
Smoldot's parachain sync gates on the relay chain's warp-sync phase
finishing. On a fresh zombienet from genesis, warp sync only engages
once peers report finalized > minimum_gap (32). Wait for the relay to
finalize 50 blocks past baseline (above the gap with headroom) on
validator-0 before launching smoldot, so warp sync has a real target
to chase. Without this gating, warp sync never starts and the
parachain's chainHead subscription never delivers blocks.

Also add log::info! step markers (network up, alice blocks, relay
finalized target/baseline, JS test launch) to make progress and
parameter values visible in test output.
Move ensure_smoldot_built() + ensure_js_deps_installed() to right after
network.detach() — zombienet runs as detached subprocesses so the build
overlaps with relay block production, shaving ~55 s off wall time.

Reorder the alice ≥REQUIRED_BLOCKS check to run after the relay-finalized
wait. By that point the relay has finalized 50 blocks past baseline so
alice has had plenty of time to produce; the check is now a fast sanity
probe with a 60 s timeout instead of 300 s.
- add BEST_METRIC const for parity with FINALIZED_METRIC
- bind relay_spec_str / para_spec_str once instead of repeating
  to_str().expect("UTF-8 path")
- pass &base_dir_str directly to with_base_dir
- drop redundant "Sanity check" comment that the log line above
  already conveys
`is_warp_sync_finished()` now also returns true once all-forks
finalizes past the chain-spec starting block, so the gate doesn't
deadlock on networks where warp sync never engages (local zombienet
from genesis, warm restart at head). Drain pending waiters in the
`NewFinalized` arm so already-queued waits actually wake. Paseo
cold-start path is unchanged.
Comment thread lib/src/sync/all.rs
Comment on lines +306 to +310
if self.warp_sync.is_none() {
true
} else {
self.finalized_block_number() > self.shared.starting_block_number
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is clearly a bug. Even if we did not use warp_sync for whatever reason, it should be set to None. We should not need this weird check here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one we should not do for now, last time this caused a lot of problems .I propose to not touch warp_sync for now.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to address edge-cases for warm start or short local zombienet, where warp sync proofs don't arrive and warp sync machine never reaches WarpSyncFinished.
After recent struggles (eg. panics) after touching warp_sync I was not very keen to touch it again.

Comment on lines +1130 to +1140
relay_chain_sync.wait_warp_sync_finished().await;
log!(
platform,
Debug,
log_target,
"Relay chain warp sync finished."
);

let mut subscription = relay_chain_sync
.subscribe_all(32, NonZero::<usize>::new(usize::MAX).unwrap())
.await;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we just ensure to send a finalized event after finishing warp sync, we would not need any of the changes here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this sounds like a good idea.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would basically mean that subscribe_all itself ensures warp sync is finished before returning.
Let me work on this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runtime_service refactored, so subscribe_all() returns only after warp sync is finished.
Applied some changes here too, just to make sure that subscription.finalized_block_scale_encoded_header is used at first, so we don't need to wait for Finalized, which can take some time.

Copy link
Copy Markdown
Contributor

@lrubasze lrubasze May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To shed more light on the refactor here.
runtime_service tracks every block received in an internal tree. To emit runtime_service::Notification::Finalized event for a block it looks up the block hash in that tree. The blocks are provided by sync_service via sync_service::Notification::Finalized event, which is not the case for warp synced blocks. If sync_service sent Finalized event for warped-synced block, then runtime_service wouldn't find such block in the tree and by design panic (lines 2613, 2621):

WakeUpReason::Notification(sync_service::Notification::Finalized {
hash,
best_block_hash_if_changed,
..
}) => {
// Sync service has reported a finalized block.
log!(
&background.platform,
Trace,
&background.log_target,
"input-chain-finalized",
block_hash = HashDisplay(&hash),
best_block_hash = if let Some(best_block_hash) = best_block_hash_if_changed {
Cow::Owned(HashDisplay(&best_block_hash).to_string())
} else {
Cow::Borrowed("<unchanged>")
}
);
if let Some(best_block_hash) = best_block_hash_if_changed {
match &mut background.tree {
Tree::FinalizedBlockRuntimeKnown { tree, .. } => {
let new_best_block = tree
.input_output_iter_unordered()
.find(|block| block.user_data.hash == best_block_hash)
.unwrap()
.id;
tree.input_set_best_block(Some(new_best_block));
}
Tree::FinalizedBlockRuntimeUnknown { tree, .. } => {
let new_best_block = tree
.input_output_iter_unordered()
.find(|block| block.user_data.hash == best_block_hash)
.unwrap()
.id;
tree.input_set_best_block(Some(new_best_block));
}
}
}
match &mut background.tree {
Tree::FinalizedBlockRuntimeKnown {
tree,
finalized_block,
..
} => {
debug_assert_ne!(finalized_block.hash, hash);
let node_to_finalize = tree
.input_output_iter_unordered()
.find(|block| block.user_data.hash == hash)
.unwrap()
.id;
tree.input_finalize(node_to_finalize);
}
Tree::FinalizedBlockRuntimeUnknown { tree, .. } => {
let node_to_finalize = tree
.input_output_iter_unordered()
.find(|block| block.user_data.hash == hash)
.unwrap()
.id;
tree.input_finalize(node_to_finalize);
}
}
}

Instead of sending the event sync_service kills the subscription after warp sync is finished. Subscribers reconnect and get a fresh tree starting from the post-warp head (subscription.finalized_block_scale_encoded_header) and they receive finalized blocks as they are being synced.

Assumed that faking finalized event for warp sync block would be less clean than simply using subscription.finalized_block_scale_encoded_header, especially after we ensure that warp sync is finished before subscribe_all() returns.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is still too complicated and I think something like this should be enough to achieve the same.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored it in a suggested manner, by sending "synthetic" Finalized upon new subscription.

It was not possible to apply it sync_service via sync_service::Notification::Finalized via light-base/src/sync_service/substrate_compat.rs as suggested here, because it caused panic as described above.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not just announce the block and directly followed by teh finalized notification? This should be possible?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check #3246

) => {
// Paraheads doesn't run a warp-sync phase of its own; delegate to the relay's
// sync service.
self.relay_chain_sync.wait_warp_sync_finished().await;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are blocking here the entire sync service.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why so? The function sync_service::parachain::start_parachain perform multiple steps, the first is to call fetch_parachain_head_from_relay, where the 'real' relay_chain_sync.wait_warp_sync_finished will happen, and only after it all the paraheads will be spawned, thus no real lock should happen because as long as the relay_chain task keep working it should immediately respond because the warp sync already finished.

The relay task doesn't seems to depend on any parachain/parahead logic, thus no deadlock seems to be possible.

This seems to be more of an adaptation of the new ToBacground variant which is only used by the relay chain task.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a dead code, just to support new ToBackground variant. Paraheads is started after warp sync is already finished. I should have left a comment.
But I agree this is a code smell, so assuming above maybe we could omit this await call entirely:

Suggested change
self.relay_chain_sync.wait_warp_sync_finished().await;
// Paraheads is spawned only after the relay's warp sync has finished,
// so this is always already done.

@lrubasze lrubasze changed the title fix(sync-service): try initial finalized block before waiting for new notifications fix: parachain warm-restart hang May 5, 2026
@lrubasze
Copy link
Copy Markdown
Contributor

lrubasze commented May 7, 2026

Closing in favor of #3246

@lrubasze lrubasze closed this May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants