fix: backpressure & stalled protocol generation by ChaoticTempest · Pull Request #701 · sig-net/mpc

ChaoticTempest · 2026-03-12T16:21:28Z

This should fix the issue with protocols getting stalled due to backpressure from queues.

This bumps the default size for posits to 2^24. This is a lot so if we ever get to it, we would have other problems at that point.

MessageInbox::{send, publish} is no longer blocking so this alleviates the issue where sending one message could block another from being sent. This internally uses try_send_lossy which logs on full that we're dropping the message

Copilot

Pull request overview

This PR addresses protocol stalls caused by backpressure in internal bounded queues by making inbox fanout non-blocking and increasing buffering for posit-related subscriptions.

Changes:

Introduces lossy, non-blocking Subscriber::try_send_lossy() and tracks channel capacity inside Subscriber.
Increases posit-init subscriber channel capacity (notably to 1 << 24) and updates MessageInbox::{send,publish} to avoid .await on per-subscriber sends.
Updates signature posit routing to use lossy non-blocking sends and adds a regression test ensuring ready messages are not blocked by posit backlog.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
`chain-signatures/node/src/protocol/signature.rs`	Routes signature-task posit messages via non-blocking lossy send to avoid stalling the spawner.
`chain-signatures/node/src/protocol/message/sub.rs`	Extends `Subscriber` to store capacity and adds `try_send_lossy()` plus capacity-aware constructors.
`chain-signatures/node/src/protocol/message/mod.rs`	Uses lossy non-blocking delivery in `MessageInbox` and increases posit-init capacities; adds a backpressure regression test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

chain-signatures/node/src/protocol/message/mod.rs

Copilot · 2026-03-12T16:26:21Z

chain-signatures/node/src/protocol/message/mod.rs

@@ -139,24 +143,21 @@ impl MessageInbox {
                    .triple
                    .entry(message.id)
                    .or_default()
-                    .send(message)
-                    .await;
+                    .try_send_lossy(message, "triple");
            }
            Message::Presignature(message) => {
                let _ = self
                    .presignature
                    .entry(message.id)
                    .or_default()
-                    .send(message)
-                    .await;
+                    .try_send_lossy(message, "presignature");
            }
            Message::Signature(message) => {
                let _ = self
                    .signature
                    .entry((message.id, message.presignature_id))
                    .or_default()
-                    .send(message)
-                    .await;
+                    .try_send_lossy(message, "signature");
            }


MessageInbox::send() now uses try_send_lossy() for all message variants (Generating/Resharing/Ready/Triple/Presignature/Signature as well as Posits). This changes the semantics from backpressure to potential message loss whenever any subscriber channel is full. If only posit fanout should be lossy/non-blocking, consider keeping awaited send() (or a different strategy) for non-posit messages so protocol-critical messages aren’t dropped under load.

Copilot · 2026-03-12T16:26:22Z

chain-signatures/node/src/protocol/message/sub.rs

 /// This should be enough to hold a few messages in the inbox.
 pub const MAX_MESSAGE_SUB_CHANNEL_SIZE: usize = 4 * 1024;


The doc comment above these constants says the channel size is "enough to hold a few messages in the inbox", but MAX_MESSAGE_POSIT_SUB_CHANNEL_SIZE is now 1 << 24. This comment is now misleading; consider updating/adding per-constant documentation explaining why posit channels are so large (and what the units are: messages, not bytes).

Suggested change

/// This should be enough to hold a few messages in the inbox.

pub const MAX_MESSAGE_SUB_CHANNEL_SIZE: usize = 4 * 1024;

/// Maximum size (in messages) of a regular subscription inbox; this only needs to hold

/// a few protocol messages without immediately backpressuring senders.

pub const MAX_MESSAGE_SUB_CHANNEL_SIZE: usize = 4 * 1024;

/// Maximum size (in messages) of a posit subscription inbox.

///

/// Posit channels can accumulate a large number of small control-plane updates, so they

/// use a much larger capacity (2^24 messages) to reduce the risk of backpressure or

/// deadlock under high concurrency. Note: this value is a count of messages, not bytes.

Copilot · 2026-03-12T16:26:22Z

chain-signatures/node/src/protocol/message/sub.rs

        let (sub, rx) = match sub {
-            Self::Subscribed(_) | Self::Unknown => Self::subscribed(),
-            Self::Unsubscribed(tx, rx) => (Self::Subscribed(tx), rx),
+            Self::Subscribed(_, _) | Self::Unknown => Self::subscribed(),


Subscriber::subscribe() does not preserve the previously configured capacity when the subscriber is already in Subscribed(..) state: it falls back to Self::subscribed() (default capacity). This can silently shrink channels that were intentionally created with a larger capacity (e.g. the posit init channels) if a second subscription overrides the first. Consider matching Self::Subscribed(_, capacity) separately and calling subscribed_with_capacity(capacity) so the capacity remains stable across resubscriptions.

Suggested change

Self::Subscribed(_, _) | Self::Unknown => Self::subscribed(),

Self::Subscribed(_, capacity) => Self::subscribed_with_capacity(capacity),

Self::Unknown => Self::subscribed(),

Copilot · 2026-03-12T16:26:23Z

chain-signatures/node/src/protocol/message/sub.rs

+                Err(mpsc::error::TrySendError::Full(_)) => {
+                    tracing::warn!(
+                        subscriber = name,
+                        capacity = self.capacity(),
+                        "dropping message because subscriber channel is full"
+                    );
+                    Ok(())


try_send_lossy() logs a warn! for every dropped message when the channel is full. Under sustained overload this can easily become a log flood (and the logging itself can become a bottleneck). Consider rate-limiting/sampling these warnings and/or emitting a counter/metric for drops, while keeping occasional structured context (subscriber name, capacity).

jakmeier

Hm, I think this could be seen as a temporary solution, mostly to check if this is the actual problem. Probably we have some cyclic dependencies between channels that completely lock up the system, which should be largely resolved with your change.

A better solution would be to identify cyclic dependencies and decouple them. Then, the worst thing that can happen with a full buffer is that things become really slow. But it shouldn't lock up the system.

In general, adding really large buffers to avoid backpressure problems just moves the problem to a different place. If more messages arrive at a channel than can be handled by the consumer, the number of messages in the channel will keep increasing until something stops it. In this case, OOM or hitting the limit and messages being dropped.

jakmeier · 2026-03-12T16:33:46Z

chain-signatures/node/src/protocol/message/sub.rs


 /// This should be enough to hold a few messages in the inbox.
 pub const MAX_MESSAGE_SUB_CHANNEL_SIZE: usize = 4 * 1024;
+pub const MAX_MESSAGE_POSIT_SUB_CHANNEL_SIZE: usize = 1 << 24;


So that would be ~16 M entries for every channel. I guess that is still better than unbound. But we will run into memory issues if these grow large and yes we still don't have a strong guarantee against the blocking.

To keep an eye on how much memory we need, can you add metrics for each channel capacity?

An estimation of buffered channel messages should be possible to read like this:

impl Subscriber { pub fn estimated_queue_len(&self) -> usize { match self { Self::Subscribed(tx) | Self::Unsubscribed(tx, _) => tx.max_capacity() - tx.capacity(), Self::Unknown => 0, } } }

Then we can have a gague vector tracking how many messages are queue up per channel.

That way, if we run into memory problems, we will see which channel is to blame. And even if it doesn't run into a full OOM crash, we can observe channel queue lengths and find out which tasks exactly are being stalled.

yep, I'll add in metrics

ChaoticTempest · 2026-03-19T03:09:36Z

one of the thing this PR misses is why we start cascading into these full channels in the first place. I was talking with @volovyks the other day about this, but it seems like since sync now only utilizes active for synced participants, protocol generation (triples and presignatures) are no longer working if sync isn't working properly. So that could be one of the reason why we're filling up the sign channel since there's too many requests in the backlog. Once there is too many requests, the size of the channel will be full. So this wouldn't entirely fix that problem.

We could add back the stable set so this wouldn't happen for purely protocol primitive generation but sync needs to be fixed either ways for signature generation

volovyks · 2026-03-19T09:30:30Z

It seems like it is time to add a limit on the number of concurrent requests we can process. Others should be added to the backlog and processed later.

This PR should fix /sync (increased timeouts).

We shouldn't generate anything when the node is out of sync; that is the point.

…lled-protocol-generation

ChaoticTempest · 2026-03-30T14:35:02Z

added metrics now for channels. A channel name (i.e. triple_posit, triple) and a channel id (i.e. "singleton" or triple id) is used to specify the specific channel. The metrics for the channels will be removed for non singleton ones once they complete (that is not to say that they stop appearing in grafana but that the entries in our node will no longer report them)

volovyks

Can we move the metrics to a different PR? It would be great to see what the situation is now and how lossy fix or any other change affects it.

Also, I want to confirm we are not breaking the cardinality rule in metrics.

jakmeier

LGTM

I would also prefer merging just the metrics in a first PR, as @volovyks asked. But if untangling it is too much work, personally I am okay with merging it together.

One other note: You report queue lengths when sending but not when removing an element. Metrics might look a bit strange, staying at a high value even when the queue becomes empty, only updating once the next message is sent. I don't have a solution. But because of this quirk I think in dashboards we should label it as max queue size to avoid confusion.

ChaoticTempest added 2 commits March 13, 2026 01:04

Add fix for backpressure on posit

cafecbb

Use try_send for regular message inbox

94ccac4

ChaoticTempest requested review from Copilot, jakmeier, ppca and volovyks March 12, 2026 16:21

Copilot started reviewing on behalf of ChaoticTempest March 12, 2026 16:22 View session

Copilot AI reviewed Mar 12, 2026

View reviewed changes

jakmeier reviewed Mar 12, 2026

View reviewed changes

ChaoticTempest added 4 commits March 25, 2026 16:03

Merge branch 'develop' of github.com:/sig-net/mpc into phuong/fix/sta…

458eb3c

…lled-protocol-generation

Add metrics for message channels

70f1c40

clippy

a40e0ab

Add SubscriberKind

5ad8dc8

volovyks reviewed Mar 31, 2026

View reviewed changes

jakmeier approved these changes Apr 2, 2026

View reviewed changes

jakmeier mentioned this pull request Apr 2, 2026

solana indexer hang Mar 19- Mar 24, 2026, likely stream getting stuck in run_stream #734

Open

		/// This should be enough to hold a few messages in the inbox.
		pub const MAX_MESSAGE_SUB_CHANNEL_SIZE: usize = 4 * 1024;

-/// This should be enough to hold a few messages in the inbox.
-pub const MAX_MESSAGE_SUB_CHANNEL_SIZE: usize = 4 * 1024;
+/// Maximum size (in messages) of a regular subscription inbox; this only needs to hold
+/// a few protocol messages without immediately backpressuring senders.
+pub const MAX_MESSAGE_SUB_CHANNEL_SIZE: usize = 4 * 1024;
+/// Maximum size (in messages) of a posit subscription inbox.
+///
+/// Posit channels can accumulate a large number of small control-plane updates, so they
+/// use a much larger capacity (2^24 messages) to reduce the risk of backpressure or
+/// deadlock under high concurrency. Note: this value is a count of messages, not bytes.

	Self::Subscribed(_, _) \| Self::Unknown => Self::subscribed(),
	Self::Subscribed(_, capacity) => Self::subscribed_with_capacity(capacity),
	Self::Unknown => Self::subscribed(),

Conversation

ChaoticTempest commented Mar 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

jakmeier left a comment

Choose a reason for hiding this comment

Uh oh!

jakmeier Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

ChaoticTempest Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

ChaoticTempest Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

ChaoticTempest commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

volovyks commented Mar 19, 2026

Uh oh!

ChaoticTempest commented Mar 30, 2026

Uh oh!

volovyks left a comment

Choose a reason for hiding this comment

Uh oh!

jakmeier left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ChaoticTempest commented Mar 19, 2026 •

edited

Loading