Skip to content

fix: catchup requeueing events that are completed#744

Merged
ChaoticTempest merged 2 commits intodevelopfrom
phuong/fix/backlog-remove-after-catchup
Apr 9, 2026
Merged

fix: catchup requeueing events that are completed#744
ChaoticTempest merged 2 commits intodevelopfrom
phuong/fix/backlog-remove-after-catchup

Conversation

@ChaoticTempest
Copy link
Copy Markdown
Contributor

The previous fix #733 didn't entirely fix the catchup problem. This should now truly fix it where Respond events during catchup will remove events from the recovered backlog entries. The previous fix got a slice/view of the recovered entries but didn't remove entries when catchup completed at all

We should be able to merge #697 after this one completes with some verification. But probably still need to bump the storage version one more time since the backlog entries on devnet are stale

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a remaining catchup/recovery bug where recovered backlog entries could be requeued even after they were completed or replaced during catchup. It changes recovery to track recovered SignIds inside Backlog and only requeue those still eligible at the time catchup completes.

Changes:

  • Replace snapshot-based recovered-entry requeueing with Backlog-tracked recovered SignId sets that are unmarked on insert/remove.
  • Update stream recovery plumbing to return only RecoveryRequeueMode and requeue via Backlog::take_requeueable_requests.
  • Add a regression test ensuring a recovered Ethereum entry replaced during catchup is not requeued after catchup completes.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
chain-signatures/node/src/stream/ops.rs Simplifies recovery return value and requeues recovered requests via Backlog::take_requeueable_requests.
chain-signatures/node/src/stream/mod.rs Uses new recovery API and requeues after catchup without relying on a recovered snapshot.
chain-signatures/node/src/backlog/mod.rs Introduces recovered_requests tracking and exposes take_requeueable_requests to safely requeue only still-valid recovered entries.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 683 to +689
let requests = self.requests.read().await;
let mut recovered = HashMap::new();
for &chain in chains {
if let Some(pending) = requests.get(&chain) {
let requeue_mode = recovered_modes.get(&chain).copied().unwrap_or_default();
recovered.insert(
chain,
RecoveredChainRequests {
pending: pending
.requests
.iter()
.map(|(id, entry)| (*id, entry.clone()))
.collect(),
requeue_mode,
},
);
let sign_ids: HashSet<_> = pending.requests.keys().copied().collect();
if !sign_ids.is_empty() {
self.mark_recovered_requests(chain, sign_ids).await;
}
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backlog::recover holds self.requests.read().await across an .await when calling mark_recovered_requests(...). Awaiting while holding a Tokio RwLock guard can cause unnecessary contention (blocking writers like insert/remove/advance) and can contribute to deadlock scenarios if other code later introduces different lock ordering. Consider collecting the recovered sign_ids for each chain while holding the read lock, then dropping the guard before awaiting (or acquiring recovered_requests once and updating it without further awaits).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

@jakmeier jakmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Respond events during catchup will remove events from the recovered backlog entries

I see, this makes sense. We definitely need that functionality.

But the recovered_requests with mark/unmark logic feels a bit like an unnatural layer tagged on top of the existing flow.

I wonder, wouldn't it be simpler to have an order like:

  1. Read checkpoint from persistent storage and use it to recover a Backlog. (Without queuing, yet)
  2. Catch up and remove requests for Respond events from the Backlog as usual
  3. Once catch up has finished, requeue all requests still in the Backlog
  4. Start normal work:
    • Start processing new blocks
    • Start participating in signing protocols

At least that sounds more straight forward and maintainable to me. But here is a good chance I'm missing something. As always, I'm open to change my mind.

@ChaoticTempest
Copy link
Copy Markdown
Contributor Author

But the recovered_requests with mark/unmark logic feels a bit like an unnatural layer tagged on top of the existing flow.

I wonder, wouldn't it be simpler to have an order like:

  1. Read checkpoint from persistent storage and use it to recover a Backlog. (Without queuing, yet)

  2. Catch up and remove requests for Respond events from the Backlog as usual

  3. Once catch up has finished, requeue all requests still in the Backlog

  4. Start normal work:

    • Start processing new blocks
    • Start participating in signing protocols

At least that sounds more straight forward and maintainable to me. But here is a good chance I'm missing something. As always, I'm open to change my mind.

@jakmeier maybe I'm not understanding you correctly, but 1, 2, and 4 are what we do here. All this PR is doing is marking whichever requests were recovered in 1. The catchup is in parallel to the main chain event loop in run_stream where we're enqueueing requests so it eventually removes items that are recovered.

For 3, Once catch up has finished, requeue all requests still in the Backlog. We can't exactly do this, because catchup adds SignRequests to the backlog too. That request already gets enqueued in the main chain event loop. So we need mark which requests were recovered to re-enqueue. I think for consistency eventually, we should have re-enqueueing emit to the main chain event loop to deduplicate logic but not an issue for now.

@ChaoticTempest
Copy link
Copy Markdown
Contributor Author

or we could make 3 work, but then that would require waiting for catchup to complete, enqueue all events to our channel (would need to buffer), then actually start the regular chain stream event loop

@jakmeier
Copy link
Copy Markdown
Contributor

jakmeier commented Apr 9, 2026

The catchup is in parallel to the main chain event loop in run_stream where we're enqueueing requests so it eventually removes items that are recovered.

Hm yes. My suggestion is to not have these two things in parallel. Finish catch up first, only then start normal operation.

For 3, Once catch up has finished, requeue all requests still in the Backlog. We can't exactly do this, because catchup adds SignRequests to the backlog too. That request already gets enqueued in the main chain event loop. So we need mark which requests were recovered to re-enqueue. I think for consistency eventually, we should have re-enqueueing emit to the main chain event loop to deduplicate logic but not an issue for now.

I think it would be better not to enqueue anything during catchup. Chances are we find responses later that cancels out requests.

or we could make 3 work, but then that would require waiting for catchup to complete, enqueue all events to our channel (would need to buffer), then actually start the regular chain stream event loop

Yeah that's exactly what I had in mind.

@ChaoticTempest
Copy link
Copy Markdown
Contributor Author

looking at it again, we can't do 3 until the linearization of ethereum (with #743). Not easily at least since it would introduce more duplicate code than I would like due that chain event loop doing a lot of heavy lifting with backlog manipulations. So it would be best to keep it all in one place.

Why duplicate the event loop? Because due to ethereum not linearly sending events with this PR, we can have live events interleave into the chain loop where we also process catchup related chain events too. To not have it interleave would mean I need a separate chain event loop just for catchup.

It would be a lot simpler when #743 lands in, but I feel like that PR might take some time and I'd like to get this one in first to see if it fixes some things. @jakmeier what do you think? It's pretty easy to remove all this marked related stuff afterwards since it's just code cruft

@jakmeier
Copy link
Copy Markdown
Contributor

jakmeier commented Apr 9, 2026

I see, thanks for explaining!

I'd like to get this one in first to see if it fixes some things. @jakmeier what do you think? It's pretty easy to remove all this marked related stuff afterwards since it's just code cruft

True, it would be easy to remove again. I don't love the idea that we need to merge things to develop to test if it fixes a given problem but I suppose this is a more general problem.

But yeah as far as I'm concerned we can move ahead with this before #743 and consider cleaning up the architecture afterwards. I don't want to delay things or force you to duplicate the event loop.

@ChaoticTempest ChaoticTempest force-pushed the phuong/fix/backlog-remove-after-catchup branch from e86ce81 to 8632167 Compare April 9, 2026 15:18
@ChaoticTempest
Copy link
Copy Markdown
Contributor Author

alright should be good to go now, just need an approval

@ChaoticTempest ChaoticTempest merged commit ac29c62 into develop Apr 9, 2026
3 checks passed
@ChaoticTempest ChaoticTempest deleted the phuong/fix/backlog-remove-after-catchup branch April 9, 2026 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants