Problem
If a signature generator times out, it takes a long time for the next posit round to succeed.
Background
When a signature generator times out, we try to generate the same signature with a new set of participants. This timeout case cannot be avoided, as we cannot guarantee participants to stay online.
A single SignTask is used through all retries. The phase can switch back from SignPhase::Generating back to SignPhase::Organizing due to retries. See loop below, in the last line, the phase is adjusted arbitrarily.
|
loop { |
|
// Check if we should abort due to resharing or epoch change |
|
if let Some(contract_state) = self.contract.state() { |
|
match contract_state { |
|
crate::protocol::ProtocolState::Resharing(_) => { |
|
tracing::info!( |
|
?sign_id, |
|
epoch = task_epoch, |
|
"signature task interrupted: contract is resharing" |
|
); |
|
return Err(SignError::Aborted); |
|
} |
|
crate::protocol::ProtocolState::Running(running) |
|
if running.epoch != task_epoch => |
|
{ |
|
tracing::info!( |
|
?sign_id, |
|
old_epoch = task_epoch, |
|
new_epoch = running.epoch, |
|
"signature task interrupted: epoch changed" |
|
); |
|
return Err(SignError::Aborted); |
|
} |
|
_ => {} |
|
} |
|
} |
|
|
|
phase = match phase.advance(&self, &mut state, &mut task_rx).await { |
|
SignPhase::Complete(result) => return result, |
|
other => other, |
|
} |
|
} |
|
} |
We preserve the SignState across retries but clean it up / refresh it as necessary.
|
struct SignState { |
|
round: usize, |
|
indexed: IndexedSignRequest, |
|
mesh_state: watch::Receiver<MeshState>, |
|
/// Budget for the current organizing+posit attempt. |
|
budget: TimeoutBudget, |
|
/// The highest round sent by a peer |
|
highest_seen_round: usize, |
|
/// Posit message for `highest_seen_round` round. |
|
/// |
|
/// These are later processed, if the task reaches the `highest_seen_round` |
|
/// as a deliberator. Proposers do not reprocess old messages. A valid peer |
|
/// would not have sent a posit message before the proposer proposes. |
|
/// |
|
/// INVARIANT: All messages stored here are for `highest_seen_round`. Must |
|
/// be cleared when `highest_seen_round` changes. |
|
buffered_messages: VecDeque<SignTaskMessage>, |
|
} |
New incoming cait-sith messages are identified by the signature request id + presignature id, which avoids conflicts between retries.
However, there seems to be an issue that makes the next posit rounds fail, after a generator was aborted. For example, running test_sign_contention_5_nodes.
# (a new posit proposer timeout every 20s)
2026-03-05T10:05:24.559855Z WARN mpc_node::protocol::signature: proposer posit deadline reached, expiring round sign_id=SignId("0101010101010101010101010101010101010101010101010101010101010101") accepts=1 threshold=4
2026-03-05T10:05:44.663955Z WARN mpc_node::protocol::signature: proposer posit deadline reached, expiring round sign_id=SignId("0101010101010101010101010101010101010101010101010101010101010101") accepts=1 threshold=4
2026-03-05T10:06:04.666168Z WARN mpc_node::protocol::signature: proposer posit deadline reached, expiring round sign_id=SignId("0101010101010101010101010101010101010101010101010101010101010101") accepts=1 threshold=4
2026-03-05T10:06:24.667963Z WARN mpc_node::protocol::signature: proposer posit deadline reached, expiring round sign_id=SignId("0101010101010101010101010101010101010101010101010101010101010101") accepts=2 threshold=4
2026-03-05T10:06:36.718891Z WARN mpc_node::protocol::signature: proposer posit deadline reached, expiring round sign_id=SignId("0101010101010101010101010101010101010101010101010101010101010101") accepts=2 threshold=4
2026-03-05T10:07:04.670150Z WARN mpc_node::protocol::signature: proposer posit deadline reached, expiring round sign_id=SignId("0101010101010101010101010101010101010101010101010101010101010101") accepts=3 threshold=4
# (finally we have another established posit)
2026-03-05T10:07:12.762125Z INFO mpc_node::protocol::signature: proposer broadcasting Start sign_id=SignId("0101010101010101010101010101010101010101010101010101010101010101") round=7 me=Participant(3) participants=[Participant(0), Participant(2), Participant(1), Participant(3)]
Notice how the low number of accepts and how it slowly increases with each new attempt. This is unexpected.
Task
Identify why posits do are misaligned after a generator is aborted and resolve the issue.
We also need a test that replicates this specific case consistently. (test_sign_contention_5_nodes runs into it by accident due to other issues.)
Problem
If a signature generator times out, it takes a long time for the next posit round to succeed.
Background
When a signature generator times out, we try to generate the same signature with a new set of participants. This timeout case cannot be avoided, as we cannot guarantee participants to stay online.
A single
SignTaskis used through all retries. The phase can switch back fromSignPhase::Generatingback toSignPhase::Organizingdue to retries. See loop below, in the last line, the phase is adjusted arbitrarily.mpc/chain-signatures/node/src/protocol/signature.rs
Lines 1051 to 1083 in 5f2bbb9
We preserve the
SignStateacross retries but clean it up / refresh it as necessary.mpc/chain-signatures/node/src/protocol/signature.rs
Lines 70 to 87 in 5f2bbb9
New incoming cait-sith messages are identified by the signature request id + presignature id, which avoids conflicts between retries.
However, there seems to be an issue that makes the next posit rounds fail, after a generator was aborted. For example, running
test_sign_contention_5_nodes.Notice how the low number of accepts and how it slowly increases with each new attempt. This is unexpected.
Task
Identify why posits do are misaligned after a generator is aborted and resolve the issue.
We also need a test that replicates this specific case consistently. (
test_sign_contention_5_nodesruns into it by accident due to other issues.)