Snapshot safety checks#1938
Conversation
- Add 'force' parameter to request_snapshot_sync to allow divergence recovery paths that have existing state (bug 1) - Replace collect_context_state_keys().is_empty() with efficient has_context_state_keys() that exits early on first match (bug 2) - Check both state keys AND context metadata (root_hash) to properly detect initialized contexts, even after deletes (bug 3) The safety check now properly allows: - Fresh nodes (force=false): enforces no existing state - Crash recovery: bypasses check when sync-in-progress marker exists - Divergence recovery (force=true): explicitly allows existing state
|
Cursor Agent can help with this pull request. Just |
|
Your PR title does not adhere to the Conventional Commits convention: Common errors to avoid:
|
There was a problem hiding this comment.
🤖 AI Code Reviewer
Reviewed by 1 agents | Quality score: 33% | Review time: 154.0s
💡 2 suggestions, 📝 2 nitpicks. See inline comments.
🤖 Generated by AI Code Reviewer | Review ID: review-90ee5624
| /// * `force` - If true, skip the safety check (for divergence recovery). | ||
| /// If false, enforce that the node is fresh (for bootstrap). | ||
| pub async fn request_snapshot_sync( | ||
| &self, |
There was a problem hiding this comment.
💡 Critical safety logic lacks unit tests
The force parameter controls whether safety checks are bypassed; this is critical for invariant I5 but the PR mentions no tests were added beyond cargo check/clippy.
Suggested fix:
Add unit tests covering: fresh node with force=false (pass), initialized node with force=false (fail), initialized node with force=true (pass), and the root_hash-only initialization case.
| @@ -710,6 +731,26 @@ fn decode_snapshot_records(payload: &[u8]) -> Result<Vec<([u8; 32], Vec<u8>)>> { | |||
| Ok(records) | |||
There was a problem hiding this comment.
💡 has_context_state_keys still scans all keys for fresh contexts
For contexts with no state keys (fresh nodes), this iterates through all keys in the table; the optimization only helps when matching keys exist and appear early in iteration order.
Suggested fix:
If the storage layer supports prefix/seek iteration based on context_id, use that to avoid full scans; otherwise document the limitation.
| /// | ||
| /// * `context_id` - The context to sync | ||
| /// * `peer_id` - The peer to sync from | ||
| /// * `force` - If true, skip the safety check (for divergence recovery). |
There was a problem hiding this comment.
📝 Nit: Doc comment could clarify force=true implications
The doc says 'skip the safety check' but could clarify that this allows overwriting existing state, which is destructive.
Suggested fix:
Expand to: 'If true, skip the safety check and allow overwriting existing state (for divergence recovery only).'
| calimero_node_primitives::sync::check_snapshot_safety(has_state) | ||
| let has_state_keys = has_context_state_keys(&handle, context_id)?; | ||
|
|
||
| let has_initialized_metadata = self |
There was a problem hiding this comment.
📝 Nit: Consider extracting magic zero-hash constant
The zero root hash [0u8; 32] represents uninitialized state; a named constant would improve readability and consistency.
Suggested fix:
Define `const UNINITIALIZED_ROOT_HASH: [u8; 32] = [0u8; 32];` or similar, or reference an existing constant if one exists.
|
Changes incorporated into #1933 |
calimero-node: Snapshot Sync Safety and Performance Fixes
Description
This PR addresses three critical bugs in the snapshot synchronization logic, improving both correctness and performance.
The changes introduce a
forceparameter torequest_snapshot_syncto explicitly differentiate between initial node bootstrap (requiring a fresh node) and divergence recovery (allowing overwrite on initialized nodes). The safety check for node freshness has been enhanced to consider both existing state keys and context metadata (non-zeroroot_hash), preventing accidental state overwrites. Additionally, the state existence check is optimized to avoid full table scans, improving performance.This fixes the following issues:
forceparameter allows divergence recovery to bypass the freshness check, restoring the automatic reconciliation path.has_context_state_keysfunction efficiently checks for state existence without collecting all keys, reducing startup latency and memory pressure.root_hash != [0u8; 32]), ensuring that contexts with existing history are correctly identified as initialized, upholding invariant I5.Test plan
The changes were verified by running
cargo check,cargo clippy, andcargo fmt.To reproduce the fixed behaviors, scenarios involving:
force=false).force=false).force=true).root_hashattempting bootstrap (should fail withforce=false).No user-interface changes are included.
Documentation update
Internal documentation for
request_snapshot_syncshould be updated to reflect the newforceparameter and its implications for the safety checks. Any design documents referencing invariant I5 regarding snapshot safety should also be reviewed for clarification.