Skip to content

Snapshot safety checks#1938

Closed
cursor[bot] wants to merge 1 commit intofeat/sync-011-snapshot-syncfrom
cursor/snapshot-safety-checks-375b
Closed

Snapshot safety checks#1938
cursor[bot] wants to merge 1 commit intofeat/sync-011-snapshot-syncfrom
cursor/snapshot-safety-checks-375b

Conversation

@cursor
Copy link
Contributor

@cursor cursor bot commented Feb 11, 2026

calimero-node: Snapshot Sync Safety and Performance Fixes

Description

This PR addresses three critical bugs in the snapshot synchronization logic, improving both correctness and performance.

The changes introduce a force parameter to request_snapshot_sync to explicitly differentiate between initial node bootstrap (requiring a fresh node) and divergence recovery (allowing overwrite on initialized nodes). The safety check for node freshness has been enhanced to consider both existing state keys and context metadata (non-zero root_hash), preventing accidental state overwrites. Additionally, the state existence check is optimized to avoid full table scans, improving performance.

This fixes the following issues:

  • Divergence recovery path now always rejected (bug_id: 54b87d22-3129-4284-9646-ff71df91231d): The force parameter allows divergence recovery to bypass the freshness check, restoring the automatic reconciliation path.
  • Safety check does full state-table scan (bug_id: a96f13c2-6907-4703-a015-5898499bfe88): A new has_context_state_keys function efficiently checks for state existence without collecting all keys, reducing startup latency and memory pressure.
  • Freshness check ignores context initialization metadata (bug_id: ref2_8d0d6323-589b-4c43-9796-3368eb02fdca): The freshness check now includes context metadata (root_hash != [0u8; 32]), ensuring that contexts with existing history are correctly identified as initialized, upholding invariant I5.

Test plan

The changes were verified by running cargo check, cargo clippy, and cargo fmt.
To reproduce the fixed behaviors, scenarios involving:

  1. A fresh node attempting snapshot sync (should succeed with force=false).
  2. An initialized node attempting snapshot sync for bootstrap (should fail with force=false).
  3. A node with root hash divergence attempting recovery via snapshot sync (should succeed with force=true).
  4. A node with no state keys but a non-zero root_hash attempting bootstrap (should fail with force=false).
    No user-interface changes are included.

Documentation update

Internal documentation for request_snapshot_sync should be updated to reflect the new force parameter and its implications for the safety checks. Any design documents referencing invariant I5 regarding snapshot safety should also be reviewed for clarification.


- Add 'force' parameter to request_snapshot_sync to allow divergence
  recovery paths that have existing state (bug 1)
- Replace collect_context_state_keys().is_empty() with efficient
  has_context_state_keys() that exits early on first match (bug 2)
- Check both state keys AND context metadata (root_hash) to properly
  detect initialized contexts, even after deletes (bug 3)

The safety check now properly allows:
- Fresh nodes (force=false): enforces no existing state
- Crash recovery: bypasses check when sync-in-progress marker exists
- Divergence recovery (force=true): explicitly allows existing state
@cursor
Copy link
Contributor Author

cursor bot commented Feb 11, 2026

Cursor Agent can help with this pull request. Just @cursor in comments and I'll start working on changes in this branch.
Learn more about Cursor Agents

@github-actions
Copy link

Your PR title does not adhere to the Conventional Commits convention:

<type>(<scope>): <subject>

Common errors to avoid:

  1. The title must be in lower case.
  2. Allowed type values are: build, ci, docs, feat, fix, perf, refactor, test.

Copy link

@meroreviewer meroreviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Reviewer

Reviewed by 1 agents | Quality score: 33% | Review time: 154.0s

💡 2 suggestions, 📝 2 nitpicks. See inline comments.


🤖 Generated by AI Code Reviewer | Review ID: review-90ee5624

/// * `force` - If true, skip the safety check (for divergence recovery).
/// If false, enforce that the node is fresh (for bootstrap).
pub async fn request_snapshot_sync(
&self,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Critical safety logic lacks unit tests

The force parameter controls whether safety checks are bypassed; this is critical for invariant I5 but the PR mentions no tests were added beyond cargo check/clippy.

Suggested fix:

Add unit tests covering: fresh node with force=false (pass), initialized node with force=false (fail), initialized node with force=true (pass), and the root_hash-only initialization case.

@@ -710,6 +731,26 @@ fn decode_snapshot_records(payload: &[u8]) -> Result<Vec<([u8; 32], Vec<u8>)>> {
Ok(records)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 has_context_state_keys still scans all keys for fresh contexts

For contexts with no state keys (fresh nodes), this iterates through all keys in the table; the optimization only helps when matching keys exist and appear early in iteration order.

Suggested fix:

If the storage layer supports prefix/seek iteration based on context_id, use that to avoid full scans; otherwise document the limitation.

///
/// * `context_id` - The context to sync
/// * `peer_id` - The peer to sync from
/// * `force` - If true, skip the safety check (for divergence recovery).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Nit: Doc comment could clarify force=true implications

The doc says 'skip the safety check' but could clarify that this allows overwriting existing state, which is destructive.

Suggested fix:

Expand to: 'If true, skip the safety check and allow overwriting existing state (for divergence recovery only).'

calimero_node_primitives::sync::check_snapshot_safety(has_state)
let has_state_keys = has_context_state_keys(&handle, context_id)?;

let has_initialized_metadata = self
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Nit: Consider extracting magic zero-hash constant

The zero root hash [0u8; 32] represents uninitialized state; a named constant would improve readability and consistency.

Suggested fix:

Define `const UNINITIALIZED_ROOT_HASH: [u8; 32] = [0u8; 32];` or similar, or reference an existing constant if one exists.

@xilosada
Copy link
Member

Changes incorporated into #1933

@xilosada xilosada closed this Feb 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants