fix: catchup requeueing events that are completed#744
fix: catchup requeueing events that are completed#744ChaoticTempest merged 2 commits intodevelopfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses a remaining catchup/recovery bug where recovered backlog entries could be requeued even after they were completed or replaced during catchup. It changes recovery to track recovered SignIds inside Backlog and only requeue those still eligible at the time catchup completes.
Changes:
- Replace snapshot-based recovered-entry requeueing with
Backlog-tracked recoveredSignIdsets that are unmarked oninsert/remove. - Update stream recovery plumbing to return only
RecoveryRequeueModeand requeue viaBacklog::take_requeueable_requests. - Add a regression test ensuring a recovered Ethereum entry replaced during catchup is not requeued after catchup completes.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
chain-signatures/node/src/stream/ops.rs |
Simplifies recovery return value and requeues recovered requests via Backlog::take_requeueable_requests. |
chain-signatures/node/src/stream/mod.rs |
Uses new recovery API and requeues after catchup without relying on a recovered snapshot. |
chain-signatures/node/src/backlog/mod.rs |
Introduces recovered_requests tracking and exposes take_requeueable_requests to safely requeue only still-valid recovered entries. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let requests = self.requests.read().await; | ||
| let mut recovered = HashMap::new(); | ||
| for &chain in chains { | ||
| if let Some(pending) = requests.get(&chain) { | ||
| let requeue_mode = recovered_modes.get(&chain).copied().unwrap_or_default(); | ||
| recovered.insert( | ||
| chain, | ||
| RecoveredChainRequests { | ||
| pending: pending | ||
| .requests | ||
| .iter() | ||
| .map(|(id, entry)| (*id, entry.clone())) | ||
| .collect(), | ||
| requeue_mode, | ||
| }, | ||
| ); | ||
| let sign_ids: HashSet<_> = pending.requests.keys().copied().collect(); | ||
| if !sign_ids.is_empty() { | ||
| self.mark_recovered_requests(chain, sign_ids).await; | ||
| } |
There was a problem hiding this comment.
Backlog::recover holds self.requests.read().await across an .await when calling mark_recovered_requests(...). Awaiting while holding a Tokio RwLock guard can cause unnecessary contention (blocking writers like insert/remove/advance) and can contribute to deadlock scenarios if other code later introduces different lock ordering. Consider collecting the recovered sign_ids for each chain while holding the read lock, then dropping the guard before awaiting (or acquiring recovered_requests once and updating it without further awaits).
jakmeier
left a comment
There was a problem hiding this comment.
Respond events during catchup will remove events from the recovered backlog entries
I see, this makes sense. We definitely need that functionality.
But the recovered_requests with mark/unmark logic feels a bit like an unnatural layer tagged on top of the existing flow.
I wonder, wouldn't it be simpler to have an order like:
- Read checkpoint from persistent storage and use it to recover a Backlog. (Without queuing, yet)
- Catch up and remove requests for Respond events from the Backlog as usual
- Once catch up has finished, requeue all requests still in the Backlog
- Start normal work:
- Start processing new blocks
- Start participating in signing protocols
At least that sounds more straight forward and maintainable to me. But here is a good chance I'm missing something. As always, I'm open to change my mind.
@jakmeier maybe I'm not understanding you correctly, but 1, 2, and 4 are what we do here. All this PR is doing is marking whichever requests were recovered in 1. The catchup is in parallel to the main chain event loop in For 3, |
|
or we could make 3 work, but then that would require waiting for catchup to complete, enqueue all events to our channel (would need to buffer), then actually start the regular chain stream event loop |
Hm yes. My suggestion is to not have these two things in parallel. Finish catch up first, only then start normal operation.
I think it would be better not to enqueue anything during catchup. Chances are we find responses later that cancels out requests.
Yeah that's exactly what I had in mind. |
|
looking at it again, we can't do 3 until the linearization of ethereum (with #743). Not easily at least since it would introduce more duplicate code than I would like due that chain event loop doing a lot of heavy lifting with backlog manipulations. So it would be best to keep it all in one place. Why duplicate the event loop? Because due to ethereum not linearly sending events with this PR, we can have live events interleave into the chain loop where we also process catchup related chain events too. To not have it interleave would mean I need a separate chain event loop just for catchup. It would be a lot simpler when #743 lands in, but I feel like that PR might take some time and I'd like to get this one in first to see if it fixes some things. @jakmeier what do you think? It's pretty easy to remove all this marked related stuff afterwards since it's just code cruft |
|
I see, thanks for explaining!
True, it would be easy to remove again. I don't love the idea that we need to merge things to develop to test if it fixes a given problem but I suppose this is a more general problem. But yeah as far as I'm concerned we can move ahead with this before #743 and consider cleaning up the architecture afterwards. I don't want to delay things or force you to duplicate the event loop. |
e86ce81 to
8632167
Compare
|
alright should be good to go now, just need an approval |
The previous fix #733 didn't entirely fix the catchup problem. This should now truly fix it where Respond events during catchup will remove events from the recovered backlog entries. The previous fix got a slice/view of the recovered entries but didn't remove entries when catchup completed at all
We should be able to merge #697 after this one completes with some verification. But probably still need to bump the storage version one more time since the backlog entries on devnet are stale