Skip to content

Conversation

@mattac21
Copy link
Contributor

@mattac21 mattac21 commented Nov 27, 2025

Description

Blocks EVMExperimentalMempool.Select until legacypool.Reset (the reorg loop) has been run for the height that Select is being called on. We do this to ensure that all txs in the pending pool are valid at Select height H. If we know that the reorg loop has been run at height H, then we know that the anteHandlers for all txs in the pending pool have passed at height H, so they can be safely selected for a proposal.

Closes: #XXXX


Author Checklist

All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.

I have...

  • tackled an existing issue or discussed with a team member
  • left instructions on how to review the changes
  • targeted the main branch

@mattac21 mattac21 changed the title feat(mempool): Block EVM mempool Select on legacypool reorg at height feat(mempool): Block EVM mempool Select on legacypool reorg Nov 27, 2025
@mattac21 mattac21 changed the base branch from main to feat/krakatoa November 28, 2025 13:31
}
queuedEvents[addr].Put(tx)

case <-pool.reorgSubscriptionCh:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added ability for arbitrary subscriptions to the completion of nextDone (when the "next" reorg loop will complete). This works by a a user pushing a request to subscribe onto this channel, then they must immediately listed on the pool.ReorgDoneCh. This loop will push nextDone onto that channel. The user can then wait on the closure of nextDone, which essentially broadcasts to all holders of this channel (subscribers to the next run of the reorg loop) that is has completed.

Comment on lines +1994 to +1995
case <-ctx.Done():
return
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likely a better way to handle this, but mempool.Select doesnt return an error, so if the context cancels during this call, then we potentially allow invalid txs to be selected, which will just be invalidated. If this happens we likely timeout propose. Not 100% sure what it means for the context to be cancelled here (is app shutting down?), based on what's actually happening panicing may be better. Likely this is a follow on to this, probably not important right now.

for pool.latestReorgHeight.Load() < height {
// reorg loop has not run at the target height, subscribe to the
// outcome of the next reorg loop iteration to know when to check again
sub, err := pool.SubscribeToNextReorg()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we have not run the reorg loop for the target height yet, we wait for the outcome of the next iteration of the loop. We are explicitly not telling the reorg loop to run here, since that would simply run it again, but not increment it to a new height (since we would need to pass the latest headers to it in order for that to happen). Also if we kick off a new run here and dont increment the latestReorgHeight, then we will continuously kick off new reorg loops until the txpool sees a new block and reorg runs on a new block, essentially doing lots of wasted work.


// SubscribeToNextReorg returns a channel that will close when the next reorg
// loop completes. An error is returned if the loop is shutting down.
func (pool *LegacyPool) SubscribeToNextReorg() (chan struct{}, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this logic a bit confusing because we operate with three different channels here, but I assume this is due to the code being ported from geth

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is geth code I am adding onto, and agree the channel/subscription logic of the reorg loop is super confusing, plus this just adds another path to it which I dont love. After staring at the pattern for a bit I actually dont mind it though, given its initial learning curve.

For my and others future reference since this wasn't immediately obvious to me... At a high level they are creating a way to broadcast to multiple subscribers that the reorg loop finished. You can't easily do this with a channel and sending a message on it saying that something finished, since only a single subscriber will pull the message off. You could count the number of subscribers and send that amount of messages, but if one of those subscribers dies then you have a hanging message in the channel. To get around this when someone subscribes, the reorg loop immediately sends the subscriber a channel on the reorgDoneCh (assuming here the subscriber then immediately waits to receive the channel), this received channel is now their subscription. The reorg loop also maintains a reference to this channel and will close this channel when the next iteration of it finishes. Closing a channel will cause all readers of the channel to get a nil value off of it, so all subscribers treat that as 'something finished'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously the only way to get this channel from reorgDoneCh was to request for the reorg loop to be run by passing it a new block header or telling it to promote txs. Here we dont actually want to do this, and simply want to wait on the result of the next run, whenever that happens (since we need to wait for a run to happen for a new block, and we can't simply pass it a new block here). So I created the reorgSubscriptionCh that users can push an empty struct onto to tell the reorg loop that we want to subscribe to the next runs result, but dont actually want to start a new run.


// Wait for the legacypool to Reset at >= blockHeight (this may have
// already happened), to ensure all txs in pending pool are valid.
m.legacyTxPool.WaitForReorgHeight(ctx, ctx.BlockHeight())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea to put a timeout into the PrepareProposalHandler such that an issue with the mempool won't block consensus? Or did you have some other way you anticipated handling long waits here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't planning on adding any app side timeout into prepare proposal or having a specific way of handling a stall here for now. My thought was just that if we have some issue here and the mempool isn't advancing, with our new design then the mempool needs to block consensus since we can't be sure that we wont propose a bad block if this condition isn't met. Instead we should just let Comet's timeout_propose run out and have a new proposer try.

We could have some maintain some kind of cursor as we validate txs in the pending pool via ante handlers and then have a timeout where if we are in the middle of validating and receive a timeout, then we only allow the txs that have been validated to be returned via select.

@swift1337 brought up a good idea during our sync this morning saying that we should add a limit to how many txs we validate during the reorg loop, since if we have a massive backlog in the pending/queued pool, then we may reach a point where we cannot validate fast enough in a single block time. We thought about adding a timeout here, but if we add a timeout without some way of then clearing out txs from the mempool, then we will just continue to timeout whenever we try to select. So I think the better solution is some sort of maxBytesValidated or maxGasValidated, similar to creating blocks (but we may also need the cursor kind of solution to know where the validation stopped).

Either way the solution is non trivial imo and we should probably just address in a new ticket later (im assuming this wont be a blocker for getting some initial numbers on how this performs).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah make sense. The failure case is probably the same as we currently have right now where if max bytes/gas is sufficiently high then you'll sit there in select for minutes even when the chain has timed out already.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to benchmark some worst-case scenarios for the reorg loop based on the number of transactions and how many need to be moved/reordered and recommend some mempool parameters based on that. Do we know, for example, how much time validating 20k transactions would take?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants