Skip to content

feat: async payload builder#438

Open
julio4 wants to merge 7 commits intomainfrom
reapply-async-payload-builder
Open

feat: async payload builder#438
julio4 wants to merge 7 commits intomainfrom
reapply-async-payload-builder

Conversation

@julio4
Copy link
Copy Markdown
Member

@julio4 julio4 commented Mar 10, 2026

See (#394, #397, #398)

structured cancellation, select! state machine, tracing spans:

  • PayloadJobCancellation replaces single CancellationToken with new_fcu/resolved/deadline/any tokens for deterministic cancellation
  • Flashblock loop refactored to biased select! state machine
  • Resolved gate prevents publishing flashblocks after getPayload
  • Tracing spans: build_fallback, build_flashblock (with index/tx_count/gas_used), execute_pool_txs, seal_block, state_root
  • New metrics: payload_job_cancellation_{resolved,new_fcu,deadline,complete,error}, flashblock_publish_suppressed_total
  • CancellationReason enum with typed reason values

@julio4 julio4 marked this pull request as draft March 10, 2026 03:16
@julio4 julio4 force-pushed the reapply-async-payload-builder branch from 0ec6a44 to f5cef26 Compare March 13, 2026 15:25
@julio4 julio4 marked this pull request as ready for review March 17, 2026 08:05
julio4 added 3 commits March 19, 2026 14:47
  - PayloadJobCancellation replaces single CancellationToken with
    new_fcu/resolved/deadline/any tokens for deterministic cancellation
  - Flashblock loop refactored to biased select! state machine
  - Resolved gate prevents publishing flashblocks after getPayload
  - Tracing spans: build_fallback, build_flashblock (with index/tx_count/
    gas_used), execute_pool_txs, seal_block, state_root
  - New metrics: payload_job_cancellation_{resolved,new_fcu,deadline,
    complete,error}, flashblock_publish_suppressed_total
  - CancellationReason enum with typed reason values
@julio4 julio4 force-pushed the reapply-async-payload-builder branch from 4ae8d72 to 628c35b Compare March 19, 2026 06:49
@avalonche
Copy link
Copy Markdown
Collaborator

mostly non-blocking comments, LGTM

@julio4 julio4 force-pushed the reapply-async-payload-builder branch from da45de0 to c13537b Compare March 20, 2026 08:23
Copy link
Copy Markdown
Contributor

@akundaz akundaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good! Left some comments, but nothing major

}

#[derive(Debug)]
pub(super) struct OpPayloadBuilderInner<Pool, Client, BuilderTx, Tasks> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you encapsulate these into an inner struct? They're meant to be cheap to clone already, now you have Arcs nested inside each other

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm they're cheap but not O-1 cheap: 3 Arc, 2 Sender clones with atomics, Pool/Client/Tasks/BuilderTx clones, BuilderConfig deep clone.
VS one arc clone with single atomic increment in the hot-path of cloning builder to send to blocking tasks (no clone were needed before as everything was sync).
Nested arc only add one arc dereference, negligible vs extra allocs and Clone trait bounds.

ws_pub is only owned by BuilderInner so we can just remove the inner arc that's true.


// If main token got canceled in here that means we received get_payload and we should drop everything and now update best_payload
// To ensure that we will return same blocks as rollup-boost (to leverage caches)
// Block canceled (new FCU, getPayload resolved, or deadline).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're doing async now, can we race this function against the cancel token instead of needing to check the token in this function?

Copy link
Copy Markdown
Member Author

@julio4 julio4 Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We race it already in phase 2, but the whole build_next_flashblock is run in a blocking task so it is not async and can't yield back. Here we synchronously poll the cancellation token to catches cancellation triggered in between tx execution and publishing to be sure we don't publish an outdated flashblock. This will be greatly improved with continuous building as it separates building from publishing so we can correctly race it in async context.

cancellation: &super::cancellation::PayloadJobCancellation,
span: &tracing::Span,
) {
let reason_str = match cancellation.reason() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would be nice to have a Display impl for CancellationReason and to use labels for the payload job cancellation metric instead of defining separate metrics

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially tried that but didn't found a way to define a single metric field in OpRBuilderMetrics with custom label.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very awkward you can't put it in the struct. See https://github.com/flashbots/op-rbuilder/pull/388/changes for an example

@akundaz
Copy link
Copy Markdown
Contributor

akundaz commented Mar 23, 2026

Also, make sure the logs look nice in staging with the new spans. rollup-boost spans are a bit difficult to go through and we should avoid that friction

@julio4
Copy link
Copy Markdown
Member Author

julio4 commented Mar 24, 2026

Also, make sure the logs look nice in staging with the new spans. rollup-boost spans are a bit difficult to go through and we should avoid that friction

Spans are gated behind telemetry flag, so it's more for local testing for now. There's plan to enable them in production when we have proper trace collector which should ensure it don't mess with logs and be useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants