Skip to content

perf(sync-service): skip parachain runtime download on warm start#3214

Open
replghost wants to merge 12 commits intomainfrom
perf/parachain-warm-skip-download
Open

perf(sync-service): skip parachain runtime download on warm start#3214
replghost wants to merge 12 commits intomainfrom
perf/parachain-warm-skip-download

Conversation

@replghost
Copy link
Copy Markdown
Contributor

@replghost replghost commented Apr 22, 2026

Problem

On every warm restart, smoldot downloads the full parachain runtime (~2 MiB) again, even though the same bytes were saved to the DB last session. If the peer it picks is slow, the download can pass the 16 s timeout, fall back to cold, and the user waits tens of seconds.

Fix

Save runtimeCode in databaseContent. Split the parachain bootstrap into two paths:

  • Warm path (warm_bootstrap): DB has cached runtime bytes. Check them against the chain, compile locally, fetch only the Aura call proofs. On any failure, return Err and let the caller try cold.
  • Cold path (cold_bootstrap_loop, the original logic moved into a retry loop): no cached bytes, download :code + :heappages and retry on failure.

start_parachain runs warm if cached bytes are present, otherwise cold. If warm fails, cold takes over.

How we check the cached bytes without downloading the runtime

We can't ask the peer for :code directly — substrate's prover always bundles the 2 MiB value with that proof, so we'd lose the saving.

Instead we ask for :code\0, a key that doesn't exist on chain. To prove it doesn't exist, the peer has to walk the trie down to :code's leaf. The leaf's value isn't read, so it isn't bundled. We get the leaf's blake2_256 hash in ~883 bytes, hash our cached bytes, and compare.

Benchmark — Paseo AH, parachain bootstrap step (10 warm restarts)

metric before PR after PR Δ
median 1.43 s 0.38 s −74%
max 2.62 s 0.56 s −79%
mean 1.42 s 0.38 s −73%
range 0.64–2.62 s 0.27–0.56 s tighter

This is the parachain bootstrap step only, which is what the PR changes. End-to-end startup time (relay warp sync + paraheads + para bootstrap) swings ±10 s between runs because of relay-side network conditions, not anything in this PR.

Database size

Adds ~2.0–2.5 MB (base64) to the DB JSON. The existing shrink logic drops the field if the DB exceeds max_size, so callers with size limits just fall back to cold like today.

Stale cache

If the chain runs a runtime upgrade between sessions, the cached bytes won't match. The hash check fails, warm returns Err, and cold takes over. So whether warm fires in practice depends on how recent the user's DB is.

On warm restart from databaseContent, the relay chain may already be
synced. fetch_parachain_head_from_relay() was waiting for a NEW
Notification::Finalized event from subscribe_all(), which might not
arrive for seconds (or indefinitely if the relay sync stalls).

The fix: try the already-finalized block from subscribe_all immediately
before waiting for new notifications. This is the block that's already
available in subscription.finalized_block_scale_encoded_header.

Before: parachain warm restart NEVER initialized (>5min timeout)
After:  parachain warm restart initializes in ~3s

The runtime hint verification in bootstrap_parachain_consensus already
handles reusing the cached runtime from databaseContent — it verifies
the merkle value and skips the ~2MB download when it matches.

Fixes #3204.
When databaseContent includes the runtime code (runtimeCode in the JSON),
compile it locally instead of downloading ~2 MiB from a P2P peer.

The warm path still fetches two lightweight Aura call proofs (~few KB each)
to verify the cached runtime works against the current block. If compilation
or verification fails, falls back to the full P2P download.

Changes:
- database.rs: persist code_storage_value (was intentionally discarded);
  decode it back as runtime_code in DatabaseContent
- sync_service.rs: add saved_runtime_code to ConfigParachain
- parachain.rs: add try_warm_start_from_cached_code() that compiles cached
  code and verifies via AuraApi call proofs; extract cold_bootstrap_loop()
- lib.rs: thread saved_runtime_code from database through to ConfigParachain

Tested on Paseo, Polkadot, Kusama Asset Hubs:
- Paseo: warm para 1.1s vs cold 2.1s (no :code download)
- Polkadot: warm para 4.1s vs cold 2.2s (call proof latency)
- Kusama: warm para 5.5s vs cold 5.9s (call proof latency)
- All three: runtimeCode saved to DB (2.0-2.5 MB), no download on warm

Builds on #3210 (correctness fix for the warm hang).
…t fallback

Database tests:
- decode_database_without_runtime_code: no runtime_code field → None
- decode_database_with_runtime_code_only: runtimeCode without merkle hint
- decode_database_with_full_hint_populates_both: all three fields present
- decode_database_invalid_base64_runtime_code_returns_error: bad input
- encode_shrink_drops_runtime_code_when_too_large: size cap drops code

Warm-start fallback tests:
- invalid_cached_runtime_fails_compilation: garbage bytes → Err
- empty_cached_runtime_fails_compilation: empty bytes → Err
- wasm_without_memory_fails_gracefully: truncated WASM → Err (not panic)
The warm path was trusting saved Aura params and heap pages from the
database without any network verification. This could silently use a
stale runtime if it was upgraded between sessions, and compile with
wrong heap pages if the chain uses custom :heappages.

Fix:
- Warm path now fetches :heappages + both Aura call proofs from the
  network (~few KB), verifying the cached code works against current
  state. Only the ~2 MiB :code download is skipped.
- Extract shared helpers (wait_for_peer, fetch_call_proof, run_aura_calls,
  build_bootstrapped_parachain) used by both cold and warm paths,
  eliminating the code duplication.
- Remove SavedParachainState struct — just pass Option<Vec<u8>> for
  the cached runtime code. Aura params are always verified from network.
- Remove aura_slot_duration/aura_authorities from DatabaseContent and
  the Aura JSON parsing in decode_database.
- Fix double-decode in decode_database: base64 is decoded once, shared
  between runtime_code_hint and runtime_code.
- Remove tests that only tested HostVmPrototype::new (the WASM compiler),
  not the warm-start logic.
Fetch :code alongside :heappages in the warm-start storage proof and
verify cached runtime bytes (or their blake2_256 hash, for state v1
value-stripped proofs) against the on-chain trie node. On mismatch
the warm path fails and falls back to cold bootstrap, preventing a
stale or substituted runtime from passing the Aura-output sanity
check unnoticed.
Debug-level log of which match arm verified cached :code and the
proof byte size, to diagnose whether peers strip the value.
Request the non-existent strict descendant `:code\0` so the absence
proof traverses through `:code`'s leaf without loading its 2 MiB
value. For state v1 chains the leaf encoding then carries only
`Hashed(blake2_256(value))`, which we already verify against the
cached bytes via blake2. Falls back to byte-equality check if the
peer bundles the value anyway.
Drop the "DB decode result" Warn log left over from benching, downgrade
the parachain warm-start availability log to Debug, and tighten the
warm_bootstrap comments. Hoist the `:code\0` probe key into a named
constant `CODE_ANCHOR_PROBE_KEY` and simplify the storage-proof error
messages to drop the escape-noisy key list.
@lrubasze lrubasze requested a review from skunert May 5, 2026 07:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants