Skip to content

perf(runtime-service): use short retry when no peers available#3213

Open
replghost wants to merge 2 commits intoparitytech:mainfrom
replghost:fix/warm-start-no-peers-retry
Open

perf(runtime-service): use short retry when no peers available#3213
replghost wants to merge 2 commits intoparitytech:mainfrom
replghost:fix/warm-start-no-peers-retry

Conversation

@replghost
Copy link
Copy Markdown
Contributor

@replghost replghost commented Apr 21, 2026

Summary

On warm start, the runtime service immediately tries to download the finalized block's runtime. No peers are connected yet. The download fails with StorageQueryError { errors: [] } (no peers queried), triggering a 4s generic cooldown. Every warm start wastes 4 seconds.

Fix

Distinguish "no peers available" from "peers returned bad data":

  • StorageQueryError::is_no_peers() — empty error list means nobody was queried
  • On is_no_peers(): exponential backoff (200ms → 400ms → 800ms), then fall through to the normal 4s cooldown after 3 fast retries
  • All other errors: full 4s cooldown (unchanged)
  • Counter resets on successful download

Adds async_op_failure_retry_at() to AsyncTree for custom retry timing.

Benchmark data

Relay chain warm start (5 runs each)

Network Before (median) After (median)
Polkadot ~5-7s 673ms
Kusama ~5-7s 628ms
Paseo ~5-7s 1405ms

Parachain end-to-end (cold → save DB → warm)

Network Cold Warm (with fix)
Polkadot Asset Hub 13.7s 7.0s
Paseo Asset Hub 19.2s 7.7s

Reproduction

cd smolbench
SMOLDOT_DIST=path/to/dist/mjs/index-nodejs.js \
  CHAIN_SPEC=chain-specs/polkadot.json WARM=1 RUNS=5 \
  node bench/runtime-download-count.mjs

Test plan

  • cargo check -p smoldot -p smoldot-light clean
  • cargo fmt clean
  • Relay chain benchmarked on 3 networks, 5 runs each
  • Parachain end-to-end benchmarked on 3 networks
  • Exponential backoff prevents busy loop when genuinely offline
  • After 3 fast retries, falls through to normal 4s cooldown

@replghost replghost force-pushed the fix/warm-start-no-peers-retry branch from 55208ed to d1d5446 Compare April 21, 2026 23:51
Comment thread light-base/src/runtime_service.rs Outdated
tree.async_op_failure(async_op_id, &background.platform.now());
if error.is_no_peers() {
// No peers available yet — use a short retry (200ms) instead of
// the full 4s cooldown. Peers typically connect within milliseconds.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true? I believe smoldot is not driving kademlia discovery under the hood.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right — smoldot doesn't drive Kademlia. Removed the comment entirely. The backoff schedule (200ms → 400ms → 800ms → 4s) speaks for itself.

Comment thread light-base/src/runtime_service.rs Outdated
if error.is_no_peers() {
// No peers available yet — use a short retry (200ms) instead of
// the full 4s cooldown. Peers typically connect within milliseconds.
let short_retry = background.platform.now() + Duration::from_millis(200);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This effectively turns the 4s cooldown into a busy loop every 200ms if the peers are genuinely not reachable. Could probably consider an exponential backoff

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented. Exponential backoff: 200ms → 400ms → 800ms, then fall through to the normal 4s cooldown after 3 fast retries. Counter (no_peers_retry_count) lives in the background task, resets on successful download. See 66866ae.


impl StorageQueryError {
/// Returns `true` if no peers were available to query.
pub fn is_no_peers(&self) -> bool {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a "no peers" path for parahead fetch?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, and it doesn't need one. fetch_parachain_head_from_relay (after #3210) waits on relay chain subscribe_all notifications, not directly on peers. The relay chain runtime service handles peer connectivity — the parachain fetch blocks on relay chain events, so there's no peer-level retry to optimize there.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves warm-start performance for the runtime service by avoiding the full retry cooldown when runtime download fails solely due to having no connected peers yet.

Changes:

  • Added StorageQueryError::is_no_peers() to detect “no peers were queried” (empty error list).
  • Runtime download failures now use a short 200ms retry when is_no_peers(), otherwise keep the existing retry behavior.
  • Added AsyncTree::async_op_failure_retry_at to support scheduling retries at a specific time.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
light-base/src/sync_service.rs Adds is_no_peers() helper on StorageQueryError.
light-base/src/runtime_service.rs Uses short retry (200ms) for runtime download failures caused by no peers.
lib/src/chain/async_tree.rs Introduces async_op_failure_retry_at and refactors async_op_failure to use it.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

}

/// Similar to [`AsyncTree::async_op_failure`], but retries at the given time
/// instead of `now + retry_after_failed`.
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

async_op_failure_retry_at is a new public API but its docs omit the same # Panic contract as async_op_failure (it will also panic if AsyncOpId is invalid due to the internal unwrap()). Please document the panic conditions (and any expectations such as whether retry_after may be in the past) to keep the API contract consistent.

Suggested change
/// instead of `now + retry_after_failed`.
/// instead of `now + retry_after_failed`.
///
/// `retry_after` may be in the past, in which case the operation can become immediately
/// necessary again.
///
/// # Panic
///
/// Panics if the [`AsyncOpId`] is invalid.

Copilot uses AI. Check for mistakes.
The runtime service tries to download the finalized block runtime
immediately at startup, before peer connections are established.
This always fails with StorageQueryError { errors: [] } (no peers
to query). Previously, this triggered the full 4s retry_after_failed
cooldown, making warm start consistently ~5-7s.

Now, "no peers" errors use a 200ms retry instead of 4s. Peers
typically connect within a few hundred milliseconds, so the retry
succeeds quickly. Other errors (peer misbehavior, decode failures)
still use the full 4s cooldown.

Benchmark on Polkadot: warm start drops from ~5.5s to ~600ms.
@replghost replghost force-pushed the fix/warm-start-no-peers-retry branch from d1d5446 to a4fe507 Compare April 22, 2026 18:43
Replace the flat 200ms retry with exponential backoff (200ms, 400ms,
800ms) before falling through to the normal 4s cooldown. Prevents a
busy loop when peers are genuinely unreachable while still giving a
fast path for the common warm-start case.

Track no_peers_retry_count in the background task. Reset on success.
After 3 fast retries, fall through to the normal cooldown.

Remove misleading comment about peer connection timing.
@replghost
Copy link
Copy Markdown
Contributor Author

@lexnv friendly ping — I've addressed your review comments, would you mind taking another look when you get a chance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants