Conversation
On warm restart from databaseContent, the relay chain may already be synced. fetch_parachain_head_from_relay() was waiting for a NEW Notification::Finalized event from subscribe_all(), which might not arrive for seconds (or indefinitely if the relay sync stalls). The fix: try the already-finalized block from subscribe_all immediately before waiting for new notifications. This is the block that's already available in subscription.finalized_block_scale_encoded_header. Before: parachain warm restart NEVER initialized (>5min timeout) After: parachain warm restart initializes in ~3s The runtime hint verification in bootstrap_parachain_consensus already handles reusing the cached runtime from databaseContent — it verifies the merkle value and skips the ~2MB download when it matches. Fixes #3204.
| platform, | ||
| Info, | ||
| log_target, | ||
| "Waiting for relay chain to finalize a block..." |
There was a problem hiding this comment.
This should be moved on the else branch where the subscription.next is polled
There was a problem hiding this comment.
Done — the log now only fires in wait_for_finalized_hash(), which is only called when the initial attempt (using the already-available finalized block) fails. The function was also restructured: try_fetch_parachain_head() attempts the fetch and returns Option, the main loop tries the initial finalized block first, and only falls through to wait_for_finalized_hash() on failure.
There was a problem hiding this comment.
Pull request overview
Fixes parachain warm-restart hangs by using the already-known finalized relay-chain block returned by subscribe_all() to fetch the parachain head immediately, instead of always waiting for a new Finalized notification (which may not arrive promptly when the relay chain is already synced).
Changes:
- Attempt fetching the parachain head using
subscription.finalized_block_scale_encoded_headerfirst. - Fall back to waiting for new
Finalizednotifications as before.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@lexnv friendly ping — I've addressed your review comments, would you mind taking another look when you get a chance? |
…d block On cold start, `try_initial_finalized` could resolve with the chain-spec checkpoint as "finalized" before warp sync promoted it to the warp target. When the checkpoint lags head (e.g. Paseo, ~70k relay blocks, as of now), the parachain bootstraps from a months-stale head and then chases ~200k para-blocks of finality, exceeding the 120s benchmark timeout. Gate on a new wait_warp_sync_finished() signal; warm restart keeps its zero-wait path.
Benchmark resultsPaseo + Asset Hub Paseo, time-to-finalized, n=10 iterations.
Highlights:
benchmarked with: #3233 |
`client.terminate()` waits on the WASM executor-shutdown event, which can park for several minutes on a multi-chain client. Skip it and exit with the assertion outcome — there's nothing to clean up beyond the process lifetime in CI.
Smoldot's parachain sync gates on the relay chain's warp-sync phase finishing. On a fresh zombienet from genesis, warp sync only engages once peers report finalized > minimum_gap (32). Wait for the relay to finalize 50 blocks past baseline (above the gap with headroom) on validator-0 before launching smoldot, so warp sync has a real target to chase. Without this gating, warp sync never starts and the parachain's chainHead subscription never delivers blocks. Also add log::info! step markers (network up, alice blocks, relay finalized target/baseline, JS test launch) to make progress and parameter values visible in test output.
Move ensure_smoldot_built() + ensure_js_deps_installed() to right after network.detach() — zombienet runs as detached subprocesses so the build overlaps with relay block production, shaving ~55 s off wall time. Reorder the alice ≥REQUIRED_BLOCKS check to run after the relay-finalized wait. By that point the relay has finalized 50 blocks past baseline so alice has had plenty of time to produce; the check is now a fast sanity probe with a 60 s timeout instead of 300 s.
- add BEST_METRIC const for parity with FINALIZED_METRIC
- bind relay_spec_str / para_spec_str once instead of repeating
to_str().expect("UTF-8 path")
- pass &base_dir_str directly to with_base_dir
- drop redundant "Sanity check" comment that the log line above
already conveys
`is_warp_sync_finished()` now also returns true once all-forks finalizes past the chain-spec starting block, so the gate doesn't deadlock on networks where warp sync never engages (local zombienet from genesis, warm restart at head). Drain pending waiters in the `NewFinalized` arm so already-queued waits actually wake. Paseo cold-start path is unchanged.
| if self.warp_sync.is_none() { | ||
| true | ||
| } else { | ||
| self.finalized_block_number() > self.shared.starting_block_number | ||
| } |
There was a problem hiding this comment.
This is clearly a bug. Even if we did not use warp_sync for whatever reason, it should be set to None. We should not need this weird check here.
There was a problem hiding this comment.
This one we should not do for now, last time this caused a lot of problems .I propose to not touch warp_sync for now.
There was a problem hiding this comment.
This is to address edge-cases for warm start or short local zombienet, where warp sync proofs don't arrive and warp sync machine never reaches WarpSyncFinished.
After recent struggles (eg. panics) after touching warp_sync I was not very keen to touch it again.
| relay_chain_sync.wait_warp_sync_finished().await; | ||
| log!( | ||
| platform, | ||
| Debug, | ||
| log_target, | ||
| "Relay chain warp sync finished." | ||
| ); | ||
|
|
||
| let mut subscription = relay_chain_sync | ||
| .subscribe_all(32, NonZero::<usize>::new(usize::MAX).unwrap()) | ||
| .await; |
There was a problem hiding this comment.
If we just ensure to send a finalized event after finishing warp sync, we would not need any of the changes here?
There was a problem hiding this comment.
Yeah this sounds like a good idea.
There was a problem hiding this comment.
Yes, that would basically mean that subscribe_all itself ensures warp sync is finished before returning.
Let me work on this.
There was a problem hiding this comment.
runtime_service refactored, so subscribe_all() returns only after warp sync is finished.
Applied some changes here too, just to make sure that subscription.finalized_block_scale_encoded_header is used at first, so we don't need to wait for Finalized, which can take some time.
There was a problem hiding this comment.
To shed more light on the refactor here.
runtime_service tracks every block received in an internal tree. To emit runtime_service::Notification::Finalized event for a block it looks up the block hash in that tree. The blocks are provided by sync_service via sync_service::Notification::Finalized event, which is not the case for warp synced blocks. If sync_service sent Finalized event for warped-synced block, then runtime_service wouldn't find such block in the tree and by design panic (lines 2613, 2621):
smoldot/light-base/src/runtime_service.rs
Lines 2562 to 2626 in 1905021
Instead of sending the event sync_service kills the subscription after warp sync is finished. Subscribers reconnect and get a fresh tree starting from the post-warp head (subscription.finalized_block_scale_encoded_header) and they receive finalized blocks as they are being synced.
Assumed that faking finalized event for warp sync block would be less clean than simply using subscription.finalized_block_scale_encoded_header, especially after we ensure that warp sync is finished before subscribe_all() returns.
There was a problem hiding this comment.
IMO this is still too complicated and I think something like this should be enough to achieve the same.
There was a problem hiding this comment.
There was a problem hiding this comment.
Can we not just announce the block and directly followed by teh finalized notification? This should be possible?
| ) => { | ||
| // Paraheads doesn't run a warp-sync phase of its own; delegate to the relay's | ||
| // sync service. | ||
| self.relay_chain_sync.wait_warp_sync_finished().await; |
There was a problem hiding this comment.
You are blocking here the entire sync service.
There was a problem hiding this comment.
Why so? The function sync_service::parachain::start_parachain perform multiple steps, the first is to call fetch_parachain_head_from_relay, where the 'real' relay_chain_sync.wait_warp_sync_finished will happen, and only after it all the paraheads will be spawned, thus no real lock should happen because as long as the relay_chain task keep working it should immediately respond because the warp sync already finished.
The relay task doesn't seems to depend on any parachain/parahead logic, thus no deadlock seems to be possible.
This seems to be more of an adaptation of the new ToBacground variant which is only used by the relay chain task.
There was a problem hiding this comment.
Yes, this is a dead code, just to support new ToBackground variant. Paraheads is started after warp sync is already finished. I should have left a comment.
But I agree this is a code smell, so assuming above maybe we could omit this await call entirely:
| self.relay_chain_sync.wait_warp_sync_finished().await; | |
| // Paraheads is spawned only after the relay's warp sync has finished, | |
| // so this is always already done. |
Move the warp sync wait into runtime_service so subscribe_all only returns after warp sync is finished.
|
Closing in favor of #3246 |
Summary
Fix parachain warm-restart hang on
fetch_parachain_head_from_relay, without regressing cold start, by moving the warp-sync wait intoruntime_serviceand emitting a synthetic Finalized event on each new subscription.Problem
fetch_parachain_head_from_relay()ignored the finalized header already returned bysubscribe_all()and awaited a newFinalizednotification viasubscription.new_blocks.next().await. On warm restart the relay isalready synced, so nothing arrives until the next GrandPa round (~6s on Polkadot, longer with slow peers), and parachain sync hangs at "Waiting for relay chain to finalize a block..."
Fix
In
runtime_service:subscribe_allon warp-sync completion. Every subscription it returns has a post-warp finalized header.Notification::Finalizedfor the current finalized block as the first stream event.Consumers waiting on
Notification::Finalized(e.g. the parachain task) now receive one immediately.API change
New public method on
SyncService:Smoke test
e2e-tests/tests/smoke.rs(added in #3234) didn't exercise the warp-sync path. On a freshly-spawned zombienet, peers report finalized below smoldot's warp_sync_minimum_gap = 32, all-forks catches up via normal GRANDPA finality before warp sync can engage, andwait_warp_sync_finished()never resolves.Updated the test to wait for the relay to finalize 50 blocks past baseline before launching
smoldot, so warp sync has a real target and WarpSyncFinished fires.Credits
try_fetch_parachain_head/wait_for_finalized_hash)wait_warp_sync_finished()gating