Skip to content

Bidirectional Sync (Part 2)#671

Merged
volovyks merged 23 commits intodevelopfrom
serhii/sync-to-active-fix
Mar 12, 2026
Merged

Bidirectional Sync (Part 2)#671
volovyks merged 23 commits intodevelopfrom
serhii/sync-to-active-fix

Conversation

@volovyks
Copy link
Copy Markdown
Contributor

@volovyks volovyks commented Feb 26, 2026

  • Added bidirectional state sync. Now, when A calls B, B returns a list of Ids that were not found in node B's storage. Node A is removing node B from holders of these not_found artifacts.
  • Now we distinguish participants and holders. Participants are those who participated in the generation, and holders are those who still have the shares. The list of participants is not used anywhere, but I decided to keep it. It can be useful for debugging, etc.
  • Holders are not a part of the centralized artifact for efficiency.
  • fetch_owned now returns Result, so we can get an empty list of owned artifacts and send it to other nodes
  • While processing the sync response, Node A will prune artifacts if the number of holders is < T. If any of the artifacts were pruned, we may want to run the sync again to remove them from the remaining holders list, but I avoided that complexity for now.
  • Now the sync process is not considered completed if any of the steps fail
  • I've removed artifact reinsertion and added a check for the number of active participants when stockpiling presignatures to prevent wasting them (such a check was already added for triples)

We need to decide whether we want to include generating/reserved/used in the state sync. Details: #671 (comment) Can be addressed separately.

.insert(dummy_pair(id), node)
.await;
#[test_log::test(tokio::test)]
async fn test_state_sync_e2e() {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I'm using a simple Integration test for State sync. I may work on the component layer implementation later.

@volovyks volovyks marked this pull request as ready for review March 9, 2026 17:43
@volovyks
Copy link
Copy Markdown
Contributor Author

volovyks commented Mar 9, 2026

After I removed the union of owned and reserved in State Sync, the tests are passing successfully (except for cases::mpc::test_sign_contention_5_nodes).
That is expected, since reserved includes Ids that are not owned by this node.

Overall, the purpose of used, reserved, and ArtifactSlot is not fully clear to me. We need to at least fully document it.

Each T or P is going through "generating" -> "stored" -> "used" lifecycle, and reserved or any other additions adds complexity that may not be required.

const SYNC_RESPONSE_TIMEOUT: Duration = Duration::from_secs(5);

/// Timeout for the entire broadcast operation (waiting for all peers to respond)
const BROADCAST_TIMEOUT: Duration = Duration::from_secs(10);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10s feels a bit short, sync operations can require many DB reads for large state.
But we can try it and increase it if we run into it.

Comment on lines +60 to +63
/// Original protocol participants
pub participants: Vec<Participant>,
/// Nodes still holding their share of the artifact
pub holders: Option<Vec<Participant>>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to track these separately? Could we not just remove non-holders from the participants list?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We absolutely can. But I wanted to distinguish holders and participants. Mostly for debugging and storage analysis. Also, holders are not a part of the serialized object to avoid deserialization/serailzation of each artifact.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, it just seemed like a lot of extra complexity with the separate tracking in Redis. But if you see enough value in having it, I have no problem with it.

@jakmeier
Copy link
Copy Markdown
Contributor

After I removed the union of owned and reserved in State Sync, the tests are passing successfully (except for cases::mpc::test_sign_contention_5_nodes). That is expected, since reserved includes Ids that are not owned by this node.

Overall, the purpose of used, reserved, and ArtifactSlot is not fully clear to me. We need to at least fully document it.

Each T or P is going through "generating" -> "stored" -> "used" lifecycle, and reserved or any other additions adds complexity that may not be required.

IIRC, reserved is to track ids that are in the generating state. I suggest renaming or removing it if we can track "generating" in other ways.

@volovyks
Copy link
Copy Markdown
Contributor Author

Here is the implementation of reserve():

pub async fn reserve(&self, id: A::Id) -> Option<ArtifactSlot<A>> {
        let used = self.used.read().await;
        if used.contains(&id) {
            return None;
        }
        if !self.reserved.write().await.insert(id) {
            return None;
        }
        drop(used);

        let start = Instant::now();
        let Some(mut conn) = self.connect().await else {
            self.reserved.write().await.remove(&id);
            return None;
        };

        // Check directly whether the artifact is already stored in Redis.
        let artifact_exists: Result<bool, _> = conn.hexists(&self.artifact_key, id).await;
        let elapsed = start.elapsed();
        crate::metrics::storage::REDIS_LATENCY
            .with_label_values(&[A::METRIC_LABEL, "reserve"])
            .observe(elapsed.as_millis() as f64);

        match artifact_exists {
            Ok(true) => {
                // artifact already stored, reserve cannot be done, remove reservation
                self.reserved.write().await.remove(&id);
                None
            }
            // artifact does not exist, reservation successful
            Ok(false) => Some(ArtifactSlot {
                id,
                storage: self.clone(),
                stored: false,
            }),
            Err(err) => {
                self.reserved.write().await.remove(&id);
                tracing::warn!(id, ?err, ?elapsed, "failed to reserve artifact");
                None
            }
        }
    }

I'm afraid it is much more complicated than just "generating". I'm looking into it now, but I want to address it separately.

@volovyks volovyks requested a review from jakmeier March 12, 2026 15:00
})?;

owned.union(&*self.reserved.read().await).copied().collect()
Ok(owned.into_iter().collect())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, so if we move ahead with this change, we open the door for the race condition @ChaoticTempest described here #649 (comment)

But I guess since all tests pass, it is no too common. I say we can merge this as-is and address the race condition in a following PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, reserved does not represent "owned by me now". For triples that are generating, that is not even possible. I'm looking into that now, not sure if that is a real concern.

@jakmeier
Copy link
Copy Markdown
Contributor

I'm afraid it is much more complicated than just "generating". I'm looking into it now, but I want to address it separately.

Yes, it manages exclusive access to a redis entry. We should probably keep that as it is. Even if we can simplify it, I wouldn't do that together with these changes.

However, with respect to state sync, what we have marked as "reserved" can be treated the same as "Generating" or "Available". (See this table: https://github.com/sig-net/mpc/blob/develop/doc/mpc_node_specification.md#non-owner-action-on-state-sync)

So, the existing union kind of makes sense. But the problem is (as you pointed out here that it also includes non-owned entries.

For Ts, until generation is done, we simply don't know who will be the owner. Maybe instead of a union, we should sync reserved ids in a separate list. The peer will then know not to delete local Ts but otherwise can ignore the list of reserved ids.

@volovyks volovyks merged commit 94a22a7 into develop Mar 12, 2026
3 of 4 checks passed
@volovyks volovyks deleted the serhii/sync-to-active-fix branch March 12, 2026 16:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants