fix: catchup requeueing events that are completed by ChaoticTempest · Pull Request #744 · sig-net/mpc

ChaoticTempest · 2026-04-09T07:29:35Z

The previous fix #733 didn't entirely fix the catchup problem. This should now truly fix it where Respond events during catchup will remove events from the recovered backlog entries. The previous fix got a slice/view of the recovered entries but didn't remove entries when catchup completed at all

We should be able to merge #697 after this one completes with some verification. But probably still need to bump the storage version one more time since the backlog entries on devnet are stale

Copilot

Pull request overview

This PR addresses a remaining catchup/recovery bug where recovered backlog entries could be requeued even after they were completed or replaced during catchup. It changes recovery to track recovered SignIds inside Backlog and only requeue those still eligible at the time catchup completes.

Changes:

Replace snapshot-based recovered-entry requeueing with Backlog-tracked recovered SignId sets that are unmarked on insert/remove.
Update stream recovery plumbing to return only RecoveryRequeueMode and requeue via Backlog::take_requeueable_requests.
Add a regression test ensuring a recovered Ethereum entry replaced during catchup is not requeued after catchup completes.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
`chain-signatures/node/src/stream/ops.rs`	Simplifies recovery return value and requeues recovered requests via `Backlog::take_requeueable_requests`.
`chain-signatures/node/src/stream/mod.rs`	Uses new recovery API and requeues after catchup without relying on a recovered snapshot.
`chain-signatures/node/src/backlog/mod.rs`	Introduces `recovered_requests` tracking and exposes `take_requeueable_requests` to safely requeue only still-valid recovered entries.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-09T07:34:53Z

chain-signatures/node/src/backlog/mod.rs

        let requests = self.requests.read().await;
-        let mut recovered = HashMap::new();
        for &chain in chains {
            if let Some(pending) = requests.get(&chain) {
-                let requeue_mode = recovered_modes.get(&chain).copied().unwrap_or_default();
-                recovered.insert(
-                    chain,
-                    RecoveredChainRequests {
-                        pending: pending
-                            .requests
-                            .iter()
-                            .map(|(id, entry)| (*id, entry.clone()))
-                            .collect(),
-                        requeue_mode,
-                    },
-                );
+                let sign_ids: HashSet<_> = pending.requests.keys().copied().collect();
+                if !sign_ids.is_empty() {
+                    self.mark_recovered_requests(chain, sign_ids).await;
+                }


Backlog::recover holds self.requests.read().await across an .await when calling mark_recovered_requests(...). Awaiting while holding a Tokio RwLock guard can cause unnecessary contention (blocking writers like insert/remove/advance) and can contribute to deadlock scenarios if other code later introduces different lock ordering. Consider collecting the recovered sign_ids for each chain while holding the read lock, then dropping the guard before awaiting (or acquiring recovered_requests once and updating it without further awaits).

jakmeier

Respond events during catchup will remove events from the recovered backlog entries

I see, this makes sense. We definitely need that functionality.

But the recovered_requests with mark/unmark logic feels a bit like an unnatural layer tagged on top of the existing flow.

I wonder, wouldn't it be simpler to have an order like:

Read checkpoint from persistent storage and use it to recover a Backlog. (Without queuing, yet)
Catch up and remove requests for Respond events from the Backlog as usual
Once catch up has finished, requeue all requests still in the Backlog
Start normal work:
- Start processing new blocks
- Start participating in signing protocols

At least that sounds more straight forward and maintainable to me. But here is a good chance I'm missing something. As always, I'm open to change my mind.

chain-signatures/node/src/backlog/mod.rs

ChaoticTempest · 2026-04-09T10:55:16Z

But the recovered_requests with mark/unmark logic feels a bit like an unnatural layer tagged on top of the existing flow.

I wonder, wouldn't it be simpler to have an order like:

Read checkpoint from persistent storage and use it to recover a Backlog. (Without queuing, yet)

Catch up and remove requests for Respond events from the Backlog as usual

Once catch up has finished, requeue all requests still in the Backlog

Start normal work:

Start processing new blocks

Start participating in signing protocols

At least that sounds more straight forward and maintainable to me. But here is a good chance I'm missing something. As always, I'm open to change my mind.

@jakmeier maybe I'm not understanding you correctly, but 1, 2, and 4 are what we do here. All this PR is doing is marking whichever requests were recovered in 1. The catchup is in parallel to the main chain event loop in run_stream where we're enqueueing requests so it eventually removes items that are recovered.

For 3, Once catch up has finished, requeue all requests still in the Backlog. We can't exactly do this, because catchup adds SignRequests to the backlog too. That request already gets enqueued in the main chain event loop. So we need mark which requests were recovered to re-enqueue. I think for consistency eventually, we should have re-enqueueing emit to the main chain event loop to deduplicate logic but not an issue for now.

ChaoticTempest · 2026-04-09T11:02:19Z

or we could make 3 work, but then that would require waiting for catchup to complete, enqueue all events to our channel (would need to buffer), then actually start the regular chain stream event loop

jakmeier · 2026-04-09T11:55:24Z

The catchup is in parallel to the main chain event loop in run_stream where we're enqueueing requests so it eventually removes items that are recovered.

Hm yes. My suggestion is to not have these two things in parallel. Finish catch up first, only then start normal operation.

For 3, Once catch up has finished, requeue all requests still in the Backlog. We can't exactly do this, because catchup adds SignRequests to the backlog too. That request already gets enqueued in the main chain event loop. So we need mark which requests were recovered to re-enqueue. I think for consistency eventually, we should have re-enqueueing emit to the main chain event loop to deduplicate logic but not an issue for now.

I think it would be better not to enqueue anything during catchup. Chances are we find responses later that cancels out requests.

or we could make 3 work, but then that would require waiting for catchup to complete, enqueue all events to our channel (would need to buffer), then actually start the regular chain stream event loop

Yeah that's exactly what I had in mind.

ChaoticTempest · 2026-04-09T12:33:52Z

looking at it again, we can't do 3 until the linearization of ethereum (with #743). Not easily at least since it would introduce more duplicate code than I would like due that chain event loop doing a lot of heavy lifting with backlog manipulations. So it would be best to keep it all in one place.

Why duplicate the event loop? Because due to ethereum not linearly sending events with this PR, we can have live events interleave into the chain loop where we also process catchup related chain events too. To not have it interleave would mean I need a separate chain event loop just for catchup.

It would be a lot simpler when #743 lands in, but I feel like that PR might take some time and I'd like to get this one in first to see if it fixes some things. @jakmeier what do you think? It's pretty easy to remove all this marked related stuff afterwards since it's just code cruft

jakmeier · 2026-04-09T12:41:59Z

I see, thanks for explaining!

I'd like to get this one in first to see if it fixes some things. @jakmeier what do you think? It's pretty easy to remove all this marked related stuff afterwards since it's just code cruft

True, it would be easy to remove again. I don't love the idea that we need to merge things to develop to test if it fixes a given problem but I suppose this is a more general problem.

But yeah as far as I'm concerned we can move ahead with this before #743 and consider cleaning up the architecture afterwards. I don't want to delay things or force you to duplicate the event loop.

ChaoticTempest · 2026-04-09T16:54:38Z

alright should be good to go now, just need an approval

Fix catchup requeueing events that are completed

e850709

ChaoticTempest requested review from Copilot, jakmeier and volovyks April 9, 2026 07:29

Copilot started reviewing on behalf of ChaoticTempest April 9, 2026 07:30 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

jakmeier reviewed Apr 9, 2026

View reviewed changes

chain-signatures/node/src/backlog/mod.rs Outdated Show resolved Hide resolved

chain-signatures/node/src/backlog/mod.rs Outdated Show resolved Hide resolved

chain-signatures/node/src/backlog/mod.rs Outdated Show resolved Hide resolved

Address comments

8632167

ChaoticTempest force-pushed the phuong/fix/backlog-remove-after-catchup branch from e86ce81 to 8632167 Compare April 9, 2026 15:18

jakmeier approved these changes Apr 9, 2026

View reviewed changes

ChaoticTempest merged commit ac29c62 into develop Apr 9, 2026
3 checks passed

ChaoticTempest deleted the phuong/fix/backlog-remove-after-catchup branch April 9, 2026 17:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: catchup requeueing events that are completed#744

fix: catchup requeueing events that are completed#744
ChaoticTempest merged 2 commits intodevelopfrom
phuong/fix/backlog-remove-after-catchup

ChaoticTempest commented Apr 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 9, 2026

Uh oh!

jakmeier left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ChaoticTempest commented Apr 9, 2026

Uh oh!

ChaoticTempest commented Apr 9, 2026

Uh oh!

jakmeier commented Apr 9, 2026

Uh oh!

ChaoticTempest commented Apr 9, 2026

Uh oh!

jakmeier commented Apr 9, 2026

Uh oh!

ChaoticTempest commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ChaoticTempest commented Apr 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

jakmeier left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ChaoticTempest commented Apr 9, 2026

Uh oh!

ChaoticTempest commented Apr 9, 2026

Uh oh!

jakmeier commented Apr 9, 2026

Uh oh!

ChaoticTempest commented Apr 9, 2026

Uh oh!

jakmeier commented Apr 9, 2026

Uh oh!

ChaoticTempest commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants