fix: parachain warm-restart hang by replghost · Pull Request #3210 · paritytech/smoldot

replghost · 2026-04-21T20:22:12Z

Summary

Fix parachain warm-restart hang on fetch_parachain_head_from_relay, without regressing cold start, by moving the warp-sync wait into runtime_service and emitting a synthetic Finalized event on each new subscription.

Problem

fetch_parachain_head_from_relay() ignored the finalized header already returned by subscribe_all() and awaited a new Finalized notification via subscription.new_blocks.next().await. On warm restart the relay is
already synced, so nothing arrives until the next GrandPa round (~6s on Polkadot, longer with slow peers), and parachain sync hangs at "Waiting for relay chain to finalize a block..."

Fix

In runtime_service:

Gate subscribe_all on warp-sync completion. Every subscription it returns has a post-warp finalized header.
On each new subscription, push a synthetic Notification::Finalized for the current finalized block as the first stream event.

Consumers waiting on Notification::Finalized (e.g. the parachain task) now receive one immediately.

API change

New public method on SyncService:

pub async fn wait_warp_sync_finished(&self);

Smoke test

e2e-tests/tests/smoke.rs (added in #3234) didn't exercise the warp-sync path. On a freshly-spawned zombienet, peers report finalized below smoldot's warp_sync_minimum_gap = 32, all-forks catches up via normal GRANDPA finality before warp sync can engage, and wait_warp_sync_finished() never resolves.

Updated the test to wait for the relay to finalize 50 blocks past baseline before launching smoldot, so warp sync has a real target and WarpSyncFinished fires.

Credits

@replghost — original investigation, warm-restart fix, refactor (try_fetch_parachain_head /wait_for_finalized_hash)
@lrubasze — cold-start staleness diagnosis, wait_warp_sync_finished() gating

On warm restart from databaseContent, the relay chain may already be synced. fetch_parachain_head_from_relay() was waiting for a NEW Notification::Finalized event from subscribe_all(), which might not arrive for seconds (or indefinitely if the relay sync stalls). The fix: try the already-finalized block from subscribe_all immediately before waiting for new notifications. This is the block that's already available in subscription.finalized_block_scale_encoded_header. Before: parachain warm restart NEVER initialized (>5min timeout) After: parachain warm restart initializes in ~3s The runtime hint verification in bootstrap_parachain_consensus already handles reusing the cached runtime from databaseContent — it verifies the merkle value and skips the ~2MB download when it matches. Fixes #3204.

lexnv · 2026-04-22T10:40:42Z

        platform,
        Info,
        log_target,
        "Waiting for relay chain to finalize a block..."


This should be moved on the else branch where the subscription.next is polled

Done — the log now only fires in wait_for_finalized_hash(), which is only called when the initial attempt (using the already-available finalized block) fails. The function was also restructured: try_fetch_parachain_head() attempts the fetch and returns Option, the main loop tries the initial finalized block first, and only falls through to wait_for_finalized_hash() on failure.

Copilot

Pull request overview

Fixes parachain warm-restart hangs by using the already-known finalized relay-chain block returned by subscribe_all() to fetch the parachain head immediately, instead of always waiting for a new Finalized notification (which may not arrive promptly when the relay chain is already synced).

Changes:

Attempt fetching the parachain head using subscription.finalized_block_scale_encoded_header first.
Fall back to waiting for new Finalized notifications as before.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

replghost · 2026-04-27T00:25:02Z

@lexnv friendly ping — I've addressed your review comments, would you mind taking another look when you get a chance?

…d block On cold start, `try_initial_finalized` could resolve with the chain-spec checkpoint as "finalized" before warp sync promoted it to the warp target. When the checkpoint lags head (e.g. Paseo, ~70k relay blocks, as of now), the parachain bootstraps from a months-stale head and then chases ~200k para-blocks of finality, exceeding the 120s benchmark timeout. Gate on a new wait_warp_sync_finished() signal; warm restart keeps its zero-wait path.

lrubasze · 2026-04-28T17:29:31Z

Benchmark results

Paseo + Asset Hub Paseo, time-to-finalized, n=10 iterations.

Mode	Baseline v3.1.1 (median / p95 / max)	PR #3210 (median / p95 / max)	Speedup (median / p95)
Cold	16.5s / 50.0s / 70.0s	5.5s / 11.2s / 12.5s	3.0× / 4.5×
Warm	17.3s / 43.5s / 50.3s	4.2s / 4.7s / 4.8s	4.1× / 9.3×

Highlights:

Warm-start variance collapses: stddev 13.4s → 0.7s. The hang-until-next-GrandPa-round behavior is gone — every run is now fast.
Cold start is 3× faster median, 4.5× at p95, no runs near the 120s timeout (worst 12.5s vs baseline 70s).

benchmarked with: #3233

`client.terminate()` waits on the WASM executor-shutdown event, which can park for several minutes on a multi-chain client. Skip it and exit with the assertion outcome — there's nothing to clean up beyond the process lifetime in CI.

Smoldot's parachain sync gates on the relay chain's warp-sync phase finishing. On a fresh zombienet from genesis, warp sync only engages once peers report finalized > minimum_gap (32). Wait for the relay to finalize 50 blocks past baseline (above the gap with headroom) on validator-0 before launching smoldot, so warp sync has a real target to chase. Without this gating, warp sync never starts and the parachain's chainHead subscription never delivers blocks. Also add log::info! step markers (network up, alice blocks, relay finalized target/baseline, JS test launch) to make progress and parameter values visible in test output.

Move ensure_smoldot_built() + ensure_js_deps_installed() to right after network.detach() — zombienet runs as detached subprocesses so the build overlaps with relay block production, shaving ~55 s off wall time. Reorder the alice ≥REQUIRED_BLOCKS check to run after the relay-finalized wait. By that point the relay has finalized 50 blocks past baseline so alice has had plenty of time to produce; the check is now a fast sanity probe with a 60 s timeout instead of 300 s.

- add BEST_METRIC const for parity with FINALIZED_METRIC - bind relay_spec_str / para_spec_str once instead of repeating to_str().expect("UTF-8 path") - pass &base_dir_str directly to with_base_dir - drop redundant "Sanity check" comment that the log line above already conveys

`is_warp_sync_finished()` now also returns true once all-forks finalizes past the chain-spec starting block, so the gate doesn't deadlock on networks where warp sync never engages (local zombienet from genesis, warm restart at head). Drain pending waiters in the `NewFinalized` arm so already-queued waits actually wake. Paseo cold-start path is unchanged.

bkchr · 2026-05-04T08:26:12Z

+        if self.warp_sync.is_none() {
+            true
+        } else {
+            self.finalized_block_number() > self.shared.starting_block_number
+        }


This is clearly a bug. Even if we did not use warp_sync for whatever reason, it should be set to None. We should not need this weird check here.

This one we should not do for now, last time this caused a lot of problems .I propose to not touch warp_sync for now.

This is to address edge-cases for warm start or short local zombienet, where warp sync proofs don't arrive and warp sync machine never reaches WarpSyncFinished.
After recent struggles (eg. panics) after touching warp_sync I was not very keen to touch it again.

bkchr · 2026-05-04T08:41:54Z

+    relay_chain_sync.wait_warp_sync_finished().await;
+    log!(
+        platform,
+        Debug,
+        log_target,
+        "Relay chain warp sync finished."
+    );
+
+    let mut subscription = relay_chain_sync
+        .subscribe_all(32, NonZero::<usize>::new(usize::MAX).unwrap())
+        .await;


If we just ensure to send a finalized event after finishing warp sync, we would not need any of the changes here?

Yeah this sounds like a good idea.

Yes, that would basically mean that subscribe_all itself ensures warp sync is finished before returning.
Let me work on this.

runtime_service refactored, so subscribe_all() returns only after warp sync is finished.
Applied some changes here too, just to make sure that subscription.finalized_block_scale_encoded_header is used at first, so we don't need to wait for Finalized, which can take some time.

To shed more light on the refactor here.
runtime_service tracks every block received in an internal tree. To emit runtime_service::Notification::Finalized event for a block it looks up the block hash in that tree. The blocks are provided by sync_service via sync_service::Notification::Finalized event, which is not the case for warp synced blocks. If sync_service sent Finalized event for warped-synced block, then runtime_service wouldn't find such block in the tree and by design panic (lines 2613, 2621):

smoldot/light-base/src/runtime_service.rs

Lines 2562 to 2626 in 1905021

WakeUpReason::Notification(sync_service::Notification::Finalized {

hash,

best_block_hash_if_changed,

..

}) => {

// Sync service has reported a finalized block.

log!(

&background.platform,

Trace,

&background.log_target,

"input-chain-finalized",

block_hash = HashDisplay(&hash),

best_block_hash = if let Some(best_block_hash) = best_block_hash_if_changed {

Cow::Owned(HashDisplay(&best_block_hash).to_string())

} else {

Cow::Borrowed("<unchanged>")

}

);

if let Some(best_block_hash) = best_block_hash_if_changed {

match &mut background.tree {

Tree::FinalizedBlockRuntimeKnown { tree, .. } => {

let new_best_block = tree

.input_output_iter_unordered()

.find(|block| block.user_data.hash == best_block_hash)

.unwrap()

.id;

tree.input_set_best_block(Some(new_best_block));

}

Tree::FinalizedBlockRuntimeUnknown { tree, .. } => {

let new_best_block = tree

.input_output_iter_unordered()

.find(|block| block.user_data.hash == best_block_hash)

.unwrap()

.id;

tree.input_set_best_block(Some(new_best_block));

}

}

}

match &mut background.tree {

Tree::FinalizedBlockRuntimeKnown {

tree,

finalized_block,

..

} => {

debug_assert_ne!(finalized_block.hash, hash);

let node_to_finalize = tree

.input_output_iter_unordered()

.find(|block| block.user_data.hash == hash)

.unwrap()

.id;

tree.input_finalize(node_to_finalize);

}

Tree::FinalizedBlockRuntimeUnknown { tree, .. } => {

let node_to_finalize = tree

.input_output_iter_unordered()

.find(|block| block.user_data.hash == hash)

.unwrap()

.id;

tree.input_finalize(node_to_finalize);

}

}

}

Instead of sending the event sync_service kills the subscription after warp sync is finished. Subscribers reconnect and get a fresh tree starting from the post-warp head (subscription.finalized_block_scale_encoded_header) and they receive finalized blocks as they are being synced.

Assumed that faking finalized event for warp sync block would be less clean than simply using subscription.finalized_block_scale_encoded_header, especially after we ensure that warp sync is finished before subscribe_all() returns.

IMO this is still too complicated and I think something like this should be enough to achieve the same.

Refactored it in a suggested manner, by sending "synthetic" Finalized upon new subscription.

It was not possible to apply it sync_service via sync_service::Notification::Finalized via light-base/src/sync_service/substrate_compat.rs as suggested here, because it caused panic as described above.

Can we not just announce the block and directly followed by teh finalized notification? This should be possible?

Check #3246

bkchr · 2026-05-04T08:42:14Z

+                ) => {
+                    // Paraheads doesn't run a warp-sync phase of its own; delegate to the relay's
+                    // sync service.
+                    self.relay_chain_sync.wait_warp_sync_finished().await;


You are blocking here the entire sync service.

Why so? The function sync_service::parachain::start_parachain perform multiple steps, the first is to call fetch_parachain_head_from_relay, where the 'real' relay_chain_sync.wait_warp_sync_finished will happen, and only after it all the paraheads will be spawned, thus no real lock should happen because as long as the relay_chain task keep working it should immediately respond because the warp sync already finished.

The relay task doesn't seems to depend on any parachain/parahead logic, thus no deadlock seems to be possible.

This seems to be more of an adaptation of the new ToBacground variant which is only used by the relay chain task.

Yes, this is a dead code, just to support new ToBackground variant. Paraheads is started after warp sync is already finished. I should have left a comment.
But I agree this is a code smell, so assuming above maybe we could omit this await call entirely:

Suggested change

self.relay_chain_sync.wait_warp_sync_finished().await;

// Paraheads is spawned only after the relay's warp sync has finished,

// so this is always already done.

Move the warp sync wait into runtime_service so subscribe_all only returns after warp sync is finished.

lrubasze · 2026-05-07T09:29:41Z

Closing in favor of #3246

replghost mentioned this pull request Apr 22, 2026

perf(sync-service): skip parachain runtime download on warm start #3214

Open

lexnv reviewed Apr 22, 2026

View reviewed changes

Comment thread light-base/src/sync_service/parachain.rs Outdated

lexnv requested a review from Copilot April 22, 2026 10:58

Copilot started reviewing on behalf of lexnv April 22, 2026 10:59 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

Comment thread light-base/src/sync_service/parachain.rs

replghost changed the title ~~fix(sync-service): use initial finalized block for parachain head fetch~~ fix(sync-service): try initial finalized block before waiting for new notifications Apr 22, 2026

This was referenced Apr 22, 2026

perf(sync-service): warm-start parachains from restored database state #3218

Closed

perf(sync-service): parallelize Aura call proofs during parachain bootstrap #3219

Open

refactor(sync-service): convert start_parachain to reactive state machine #3227

Closed

This was referenced Apr 28, 2026

Parachain bootstrap waits up to 67s for relay finalization #3235

Closed

perf(sync-service): bootstrap parachain optimistically from relay tip #3236

Closed

lrubasze added 2 commits April 28, 2026 16:41

Merge branch 'main' into fix/parachain-warm-start

ae24640

lrubasze added 4 commits April 28, 2026 20:38

lrubasze requested review from AndreiEres, lexnv and skunert April 29, 2026 06:14

lrubasze added 2 commits April 29, 2026 07:16

docs: note runtime-service lag in is_warp_sync_finished

1905021

skunert approved these changes Apr 29, 2026

View reviewed changes

bkchr reviewed May 4, 2026

View reviewed changes

lrubasze added 3 commits May 4, 2026 14:33

refactor(sync): gate runtime_service::subscribe_all on warp-sync

fe3f9d6

Move the warp sync wait into runtime_service so subscribe_all only returns after warp sync is finished.

restore fetch_parachain_head_from_relay to the original version

5f75a55

feat(runtime-service): emit synthetic Finalized on new subscription

eabdc15

lrubasze changed the title ~~fix(sync-service): try initial finalized block before waiting for new notifications~~ fix: parachain warm-restart hang May 5, 2026

lrubasze mentioned this pull request May 7, 2026

Skip post-warp-sync wait in parachain bootstrap #3246

Open

lrubasze closed this May 7, 2026

	WakeUpReason::Notification(sync_service::Notification::Finalized {
	hash,
	best_block_hash_if_changed,
	..
	}) => {
	// Sync service has reported a finalized block.

	log!(
	&background.platform,
	Trace,
	&background.log_target,
	"input-chain-finalized",
	block_hash = HashDisplay(&hash),
	best_block_hash = if let Some(best_block_hash) = best_block_hash_if_changed {
	Cow::Owned(HashDisplay(&best_block_hash).to_string())
	} else {
	Cow::Borrowed("<unchanged>")
	}
	);

	if let Some(best_block_hash) = best_block_hash_if_changed {
	match &mut background.tree {
	Tree::FinalizedBlockRuntimeKnown { tree, .. } => {
	let new_best_block = tree
	.input_output_iter_unordered()
	.find(\|block\| block.user_data.hash == best_block_hash)
	.unwrap()
	.id;
	tree.input_set_best_block(Some(new_best_block));
	}
	Tree::FinalizedBlockRuntimeUnknown { tree, .. } => {
	let new_best_block = tree
	.input_output_iter_unordered()
	.find(\|block\| block.user_data.hash == best_block_hash)
	.unwrap()
	.id;
	tree.input_set_best_block(Some(new_best_block));
	}
	}
	}

	match &mut background.tree {
	Tree::FinalizedBlockRuntimeKnown {
	tree,
	finalized_block,
	..
	} => {
	debug_assert_ne!(finalized_block.hash, hash);
	let node_to_finalize = tree
	.input_output_iter_unordered()
	.find(\|block\| block.user_data.hash == hash)
	.unwrap()
	.id;
	tree.input_finalize(node_to_finalize);
	}
	Tree::FinalizedBlockRuntimeUnknown { tree, .. } => {
	let node_to_finalize = tree
	.input_output_iter_unordered()
	.find(\|block\| block.user_data.hash == hash)
	.unwrap()
	.id;
	tree.input_finalize(node_to_finalize);
	}
	}
	}

	self.relay_chain_sync.wait_warp_sync_finished().await;
	// Paraheads is spawned only after the relay's warp sync has finished,
	// so this is always already done.

Conversation

replghost commented Apr 21, 2026 • edited by lrubasze Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Fix

API change

Smoke test

Credits

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

replghost commented Apr 27, 2026

Uh oh!

lrubasze commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark results

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lrubasze May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lrubasze commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

replghost commented Apr 21, 2026 •

edited by lrubasze

Loading

lrubasze commented Apr 28, 2026 •

edited

Loading

lrubasze May 4, 2026 •

edited

Loading