Skip to content

Subsequent signature posit round are slow #689

@jakmeier

Description

@jakmeier

Problem

If a signature generator times out, it takes a long time for the next posit round to succeed.

Background

When a signature generator times out, we try to generate the same signature with a new set of participants. This timeout case cannot be avoided, as we cannot guarantee participants to stay online.

A single SignTask is used through all retries. The phase can switch back from SignPhase::Generating back to SignPhase::Organizing due to retries. See loop below, in the last line, the phase is adjusted arbitrarily.

loop {
// Check if we should abort due to resharing or epoch change
if let Some(contract_state) = self.contract.state() {
match contract_state {
crate::protocol::ProtocolState::Resharing(_) => {
tracing::info!(
?sign_id,
epoch = task_epoch,
"signature task interrupted: contract is resharing"
);
return Err(SignError::Aborted);
}
crate::protocol::ProtocolState::Running(running)
if running.epoch != task_epoch =>
{
tracing::info!(
?sign_id,
old_epoch = task_epoch,
new_epoch = running.epoch,
"signature task interrupted: epoch changed"
);
return Err(SignError::Aborted);
}
_ => {}
}
}
phase = match phase.advance(&self, &mut state, &mut task_rx).await {
SignPhase::Complete(result) => return result,
other => other,
}
}
}

We preserve the SignState across retries but clean it up / refresh it as necessary.

struct SignState {
round: usize,
indexed: IndexedSignRequest,
mesh_state: watch::Receiver<MeshState>,
/// Budget for the current organizing+posit attempt.
budget: TimeoutBudget,
/// The highest round sent by a peer
highest_seen_round: usize,
/// Posit message for `highest_seen_round` round.
///
/// These are later processed, if the task reaches the `highest_seen_round`
/// as a deliberator. Proposers do not reprocess old messages. A valid peer
/// would not have sent a posit message before the proposer proposes.
///
/// INVARIANT: All messages stored here are for `highest_seen_round`. Must
/// be cleared when `highest_seen_round` changes.
buffered_messages: VecDeque<SignTaskMessage>,
}

New incoming cait-sith messages are identified by the signature request id + presignature id, which avoids conflicts between retries.

However, there seems to be an issue that makes the next posit rounds fail, after a generator was aborted. For example, running test_sign_contention_5_nodes.

# (a new posit proposer timeout every 20s)
2026-03-05T10:05:24.559855Z WARN mpc_node::protocol::signature: proposer posit deadline reached, expiring round sign_id=SignId("0101010101010101010101010101010101010101010101010101010101010101") accepts=1 threshold=4
2026-03-05T10:05:44.663955Z WARN mpc_node::protocol::signature: proposer posit deadline reached, expiring round sign_id=SignId("0101010101010101010101010101010101010101010101010101010101010101") accepts=1 threshold=4
2026-03-05T10:06:04.666168Z WARN mpc_node::protocol::signature: proposer posit deadline reached, expiring round sign_id=SignId("0101010101010101010101010101010101010101010101010101010101010101") accepts=1 threshold=4
2026-03-05T10:06:24.667963Z WARN mpc_node::protocol::signature: proposer posit deadline reached, expiring round sign_id=SignId("0101010101010101010101010101010101010101010101010101010101010101") accepts=2 threshold=4
2026-03-05T10:06:36.718891Z WARN mpc_node::protocol::signature: proposer posit deadline reached, expiring round sign_id=SignId("0101010101010101010101010101010101010101010101010101010101010101") accepts=2 threshold=4
2026-03-05T10:07:04.670150Z WARN mpc_node::protocol::signature: proposer posit deadline reached, expiring round sign_id=SignId("0101010101010101010101010101010101010101010101010101010101010101") accepts=3 threshold=4

# (finally we have another established posit)
2026-03-05T10:07:12.762125Z INFO mpc_node::protocol::signature: proposer broadcasting Start sign_id=SignId("0101010101010101010101010101010101010101010101010101010101010101") round=7 me=Participant(3) participants=[Participant(0), Participant(2), Participant(1), Participant(3)]

Notice how the low number of accepts and how it slowly increases with each new attempt. This is unexpected.

Task

Identify why posits do are misaligned after a generator is aborted and resolve the issue.

We also need a test that replicates this specific case consistently. (test_sign_contention_5_nodes runs into it by accident due to other issues.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions