perf(runtime-service): use short retry when no peers available#3213
perf(runtime-service): use short retry when no peers available#3213replghost wants to merge 2 commits intoparitytech:mainfrom
Conversation
55208ed to
d1d5446
Compare
| tree.async_op_failure(async_op_id, &background.platform.now()); | ||
| if error.is_no_peers() { | ||
| // No peers available yet — use a short retry (200ms) instead of | ||
| // the full 4s cooldown. Peers typically connect within milliseconds. |
There was a problem hiding this comment.
Is this true? I believe smoldot is not driving kademlia discovery under the hood.
There was a problem hiding this comment.
You're right — smoldot doesn't drive Kademlia. Removed the comment entirely. The backoff schedule (200ms → 400ms → 800ms → 4s) speaks for itself.
| if error.is_no_peers() { | ||
| // No peers available yet — use a short retry (200ms) instead of | ||
| // the full 4s cooldown. Peers typically connect within milliseconds. | ||
| let short_retry = background.platform.now() + Duration::from_millis(200); |
There was a problem hiding this comment.
This effectively turns the 4s cooldown into a busy loop every 200ms if the peers are genuinely not reachable. Could probably consider an exponential backoff
There was a problem hiding this comment.
Implemented. Exponential backoff: 200ms → 400ms → 800ms, then fall through to the normal 4s cooldown after 3 fast retries. Counter (no_peers_retry_count) lives in the background task, resets on successful download. See 66866ae.
|
|
||
| impl StorageQueryError { | ||
| /// Returns `true` if no peers were available to query. | ||
| pub fn is_no_peers(&self) -> bool { |
There was a problem hiding this comment.
Do we have a "no peers" path for parahead fetch?
There was a problem hiding this comment.
No, and it doesn't need one. fetch_parachain_head_from_relay (after #3210) waits on relay chain subscribe_all notifications, not directly on peers. The relay chain runtime service handles peer connectivity — the parachain fetch blocks on relay chain events, so there's no peer-level retry to optimize there.
There was a problem hiding this comment.
Pull request overview
Improves warm-start performance for the runtime service by avoiding the full retry cooldown when runtime download fails solely due to having no connected peers yet.
Changes:
- Added
StorageQueryError::is_no_peers()to detect “no peers were queried” (empty error list). - Runtime download failures now use a short 200ms retry when
is_no_peers(), otherwise keep the existing retry behavior. - Added
AsyncTree::async_op_failure_retry_atto support scheduling retries at a specific time.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| light-base/src/sync_service.rs | Adds is_no_peers() helper on StorageQueryError. |
| light-base/src/runtime_service.rs | Uses short retry (200ms) for runtime download failures caused by no peers. |
| lib/src/chain/async_tree.rs | Introduces async_op_failure_retry_at and refactors async_op_failure to use it. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
|
|
||
| /// Similar to [`AsyncTree::async_op_failure`], but retries at the given time | ||
| /// instead of `now + retry_after_failed`. |
There was a problem hiding this comment.
async_op_failure_retry_at is a new public API but its docs omit the same # Panic contract as async_op_failure (it will also panic if AsyncOpId is invalid due to the internal unwrap()). Please document the panic conditions (and any expectations such as whether retry_after may be in the past) to keep the API contract consistent.
| /// instead of `now + retry_after_failed`. | |
| /// instead of `now + retry_after_failed`. | |
| /// | |
| /// `retry_after` may be in the past, in which case the operation can become immediately | |
| /// necessary again. | |
| /// | |
| /// # Panic | |
| /// | |
| /// Panics if the [`AsyncOpId`] is invalid. |
The runtime service tries to download the finalized block runtime
immediately at startup, before peer connections are established.
This always fails with StorageQueryError { errors: [] } (no peers
to query). Previously, this triggered the full 4s retry_after_failed
cooldown, making warm start consistently ~5-7s.
Now, "no peers" errors use a 200ms retry instead of 4s. Peers
typically connect within a few hundred milliseconds, so the retry
succeeds quickly. Other errors (peer misbehavior, decode failures)
still use the full 4s cooldown.
Benchmark on Polkadot: warm start drops from ~5.5s to ~600ms.
d1d5446 to
a4fe507
Compare
Replace the flat 200ms retry with exponential backoff (200ms, 400ms, 800ms) before falling through to the normal 4s cooldown. Prevents a busy loop when peers are genuinely unreachable while still giving a fast path for the common warm-start case. Track no_peers_retry_count in the background task. Reset on success. After 3 fast retries, fall through to the normal cooldown. Remove misleading comment about peer connection timing.
|
@lexnv friendly ping — I've addressed your review comments, would you mind taking another look when you get a chance? |
Summary
On warm start, the runtime service immediately tries to download the finalized block's runtime. No peers are connected yet. The download fails with
StorageQueryError { errors: [] }(no peers queried), triggering a 4s generic cooldown. Every warm start wastes 4 seconds.Fix
Distinguish "no peers available" from "peers returned bad data":
StorageQueryError::is_no_peers()— empty error list means nobody was queriedis_no_peers(): exponential backoff (200ms → 400ms → 800ms), then fall through to the normal 4s cooldown after 3 fast retriesAdds
async_op_failure_retry_at()toAsyncTreefor custom retry timing.Benchmark data
Relay chain warm start (5 runs each)
Parachain end-to-end (cold → save DB → warm)
Reproduction
cd smolbench SMOLDOT_DIST=path/to/dist/mjs/index-nodejs.js \ CHAIN_SPEC=chain-specs/polkadot.json WARM=1 RUNS=5 \ node bench/runtime-download-count.mjsTest plan
cargo check -p smoldot -p smoldot-lightcleancargo fmtclean