feat(sync): 011 - Snapshot Sync (Fresh Nodes Only)#1933
Conversation
Add snapshot sync types for fresh node bootstrap (CIP §6). Types Added: - SnapshotRequest: Initiate snapshot transfer (compressed/uncompressed) - SnapshotEntity: Single entity with metadata and collection info - SnapshotEntityPage: Paginated entities for transfer - SnapshotComplete: Completion marker with root hash and DAG heads - SnapshotVerifyResult: Verification result enum Enhanced SnapshotError with new variants: - SnapshotOnInitializedNode (Invariant I5) - RootHashMismatch (Invariant I7) - TransferInterrupted - DecompressionFailed Security: - DoS protection constants (MAX_ENTITIES_PER_PAGE, MAX_SNAPSHOT_PAGES, etc.) - Validation via is_valid() methods on all types - check_snapshot_safety() function for I5 enforcement Tests: - 54 comprehensive tests covering all types - Boundary condition tests (at-limit, over-limit) - Security/exploit prevention tests - Special values tests (zeros, max values) - Serialization roundtrip tests - Invariant enforcement tests (I5, I7) Closes #1778
There was a problem hiding this comment.
🤖 AI Code Reviewer
Reviewed by 1 agents | Quality score: 33% | Review time: 110.4s
🟡 1 warnings, 💡 1 suggestions, 📝 2 nitpicks. See inline comments.
🤖 Generated by AI Code Reviewer | Review ID: review-fe768c09
This comment has been minimized.
This comment has been minimized.
…issues - Reduce DoS limits to reasonable values: - MAX_ENTITIES_PER_PAGE: 10,000 -> 1,000 - MAX_SNAPSHOT_PAGES: 1,000,000 -> 10,000 (~2.5GB at 256KB/page) - MAX_DAG_HEADS: 1,000 -> 100 - Add MAX_COMPRESSED_PAYLOAD_SIZE (8MB) for payload validation - Add SnapshotPage::is_valid() checks: - Validate sent_count <= page_count (impossible state prevention) - Validate compressed payload size against MAX_COMPRESSED_PAYLOAD_SIZE - Add SnapshotEntityPage::is_valid() ordering invariants: - Check page_number < total_pages - Check is_last coherence with page_number - Fix SnapshotVerifyResult::to_error() to preserve error types: - Add SnapshotError::EntityCountMismatch variant - Add SnapshotError::MissingPages variant - Map verification results to specific errors instead of InvalidBoundary - Wire check_snapshot_safety() into runtime: - Call in request_snapshot_sync() before applying snapshot - Allow crash recovery (sync-in-progress marker) to bypass check - Add comprehensive tests for new validation logic Applied via @cursor push command
There was a problem hiding this comment.
🤖 AI Code Reviewer
Reviewed by 1 agents | Quality score: 33% | Review time: 102.4s
🟡 1 warnings, 💡 1 suggestions, 📝 1 nitpicks. See inline comments.
🤖 Generated by AI Code Reviewer | Review ID: review-80d889d1
| /// Whether this is the last page. | ||
| pub is_last: bool, | ||
| } | ||
|
|
There was a problem hiding this comment.
🟡 SnapshotEntityPage validation allows invalid state when total_pages == 0
When total_pages == 0, the bounds check page_number >= total_pages and the is_last coherence check are both skipped, allowing pages like {page_number: 999, total_pages: 0, is_last: true} to pass validation.
Suggested fix:
Add an explicit check: if `total_pages == 0`, either reject as invalid or only allow `page_number == 0`. Consider: `if self.total_pages == 0 && self.page_number != 0 { return false; }`
| /// Whether the initiator is definitely a fresh node (for safety check). | ||
| /// If false, responder SHOULD verify this claim. | ||
| pub is_fresh_node: bool, | ||
| } |
There was a problem hiding this comment.
💡 SnapshotRequest.is_fresh_node is trust-on-claim with no verification mechanism
The comment says responder 'SHOULD verify this claim' but no verification API is provided; receivers must either trust the claim or implement their own check.
Suggested fix:
Consider adding a `verify_fresh_node` callback or documenting the expected verification approach in the struct-level docs.
| ) -> Self { | ||
| Self { | ||
| root_hash, | ||
| total_entities, |
There was a problem hiding this comment.
📝 Nit: compression_ratio name may mislead—returns compressed/uncompressed
A ratio <1 means good compression; consider documenting that explicitly or renaming to compressed_to_uncompressed_ratio.
Suggested fix:
Add doc comment clarifying that values <1 indicate effective compression.
This comment has been minimized.
This comment has been minimized.
- Add 'force' parameter to request_snapshot_sync to allow divergence recovery paths that have existing state (bug 1) - Replace collect_context_state_keys().is_empty() with efficient has_context_state_keys() that exits early on first match (bug 2) - Check both state keys AND context metadata (root_hash) to properly detect initialized contexts, even after deletes (bug 3) The safety check now properly allows: - Fresh nodes (force=false): enforces no existing state - Crash recovery: bypasses check when sync-in-progress marker exists - Divergence recovery (force=true): explicitly allows existing state
There was a problem hiding this comment.
🤖 AI Code Reviewer
Reviewed by 1 agents | Quality score: 33% | Review time: 253.9s
🟡 1 warnings, 💡 1 suggestions, 📝 2 nitpicks. See inline comments.
🤖 Generated by AI Code Reviewer | Review ID: review-987b53ea
| return false; | ||
| } | ||
|
|
||
| // Check page number is within bounds (page_number is 0-indexed) |
There was a problem hiding this comment.
🟡 Validation allows total_pages=0 with non-empty entities
When total_pages is 0, the page_number and is_last coherence checks are both skipped, allowing semantically invalid states like having entities but claiming zero total pages.
Suggested fix:
Add a check: `if self.total_pages == 0 && !self.entities.is_empty() { return false; }` or require `total_pages >= 1` when entities exist.
| /// Maximum page size in bytes (0 = use responder's default). | ||
| pub max_page_size: u32, | ||
|
|
||
| /// Whether the initiator is definitely a fresh node (for safety check). |
There was a problem hiding this comment.
💡 is_fresh_node field is declared but never validated
The comment states 'responder SHOULD verify this claim' but no receiver-side validation code uses this field; the actual safety check uses check_snapshot_safety(has_local_state) instead.
Suggested fix:
Either remove the field if responders will always check their own state, add a TODO comment explaining when validation will be implemented, or document that this is metadata-only.
| #[must_use] | ||
| pub fn new( | ||
| root_hash: [u8; 32], | ||
| total_entities: usize, |
There was a problem hiding this comment.
📝 Nit: compression_ratio returns misleading value when uncompressed_size is 0
When uncompressed_size is 0, the ratio returns compressed_size/1 which is semantically incorrect (e.g., returns 100.0 for 100 compressed bytes with 0 uncompressed).
Suggested fix:
Consider returning None when uncompressed_size is 0, or document this edge case behavior in the doc comment.
|
|
||
| /// Check if page is within valid bounds. | ||
| #[must_use] | ||
| pub fn is_valid(&self) -> bool { |
There was a problem hiding this comment.
📝 Nit: SnapshotPage.is_valid() doesn't validate cursor coherence
Unlike SnapshotEntityPage, SnapshotPage.is_valid() doesn't check if cursor being None correlates with sent_count == page_count (last page condition).
Suggested fix:
Consider adding: `&& (self.cursor.is_some() || self.sent_count == self.page_count)` for consistency with is_last() semantics.
|
Bugbot Autofix prepared fixes for 1 of the 1 bugs found in the latest run.
Or push these changes by commenting: Preview (0dea6f7797)diff --git a/crates/node/src/sync/manager.rs b/crates/node/src/sync/manager.rs
--- a/crates/node/src/sync/manager.rs
+++ b/crates/node/src/sync/manager.rs
@@ -1404,9 +1404,12 @@
self.node_state
.start_sync_session(context_id, sync_start_hlc);
- // force=true: This is divergence recovery where existing state is expected
+ // force=false: Enforce Invariant I5 - only allow snapshot on fresh nodes.
+ // If the node has state, this will fail, which is correct - divergence
+ // or pruned history on initialized nodes cannot be safely resolved via
+ // snapshot overwrite. CRDT merge must be used instead.
let result = self
- .request_snapshot_sync(context_id, peer_id, true)
+ .request_snapshot_sync(context_id, peer_id, false)
.await?;
info!(%context_id, records = result.applied_records, "Snapshot sync completed");
diff --git a/crates/node/src/sync/snapshot.rs b/crates/node/src/sync/snapshot.rs
--- a/crates/node/src/sync/snapshot.rs
+++ b/crates/node/src/sync/snapshot.rs
@@ -249,8 +249,9 @@
// Check Invariant I5: Snapshot sync should only be used for fresh nodes
// OR for crash recovery (detected by sync-in-progress marker).
- // OR when explicitly forced for divergence recovery.
// This prevents accidental state overwrites on initialized nodes.
+ // NOTE: force=true is reserved for exceptional cases like test fixtures;
+ // divergence recovery must NOT bypass this check (see I5).
let is_crash_recovery = self.check_sync_in_progress(context_id)?.is_some();
if !force && !is_crash_recovery {
// Check both state keys and context metadata to determine initialization. |
There was a problem hiding this comment.
🤖 AI Code Reviewer
Reviewed by 1 agents | Quality score: 33% | Review time: 120.0s
🟡 1 warnings, 💡 2 suggestions, 📝 1 nitpicks. See inline comments.
🤖 Generated by AI Code Reviewer | Review ID: review-3d76bab1
| return false; | ||
| } | ||
|
|
||
| // Check page number is within bounds (page_number is 0-indexed) |
There was a problem hiding this comment.
🟡 Validation skips checks when total_pages is zero
When total_pages == 0, the page_number >= total_pages and is_last coherence checks are both skipped, allowing semantically invalid states like page_number=100, total_pages=0, is_last=true to pass validation.
Suggested fix:
Either disallow `total_pages == 0` (return false if total_pages == 0), or document that it represents 'unknown total' and add a comment explaining this intentional behavior.
| assert!(page.is_valid()); | ||
| } | ||
|
|
||
| #[test] |
There was a problem hiding this comment.
💡 Missing test coverage for total_pages=0 edge case
The test_snapshot_entity_page_validation test covers many edge cases but omits testing SnapshotEntityPage when total_pages == 0, which could allow unexpected validation results.
Suggested fix:
Add a test case like `SnapshotEntityPage::new(0, 0, vec![], true)` and `SnapshotEntityPage::new(100, 0, vec![], false)` to document expected behavior.
| /// Maximum page size in bytes (0 = use responder's default). | ||
| pub max_page_size: u32, | ||
|
|
||
| /// Whether the initiator is definitely a fresh node (for safety check). |
There was a problem hiding this comment.
💡 is_fresh_node field verification is ambiguous
The comment says responder 'SHOULD verify this claim' but there's no enforcement mechanism or verification logic provided; this could lead to inconsistent implementations.
Suggested fix:
Either remove the `is_fresh_node` field (since the receiver can't trust it anyway), or document what verification the responder should perform.
| // Snapshot Verification (Invariant I7) | ||
| // ============================================================================= | ||
|
|
||
| /// Result of verifying a snapshot. |
There was a problem hiding this comment.
📝 Nit: SnapshotVerifyResult lacks Borsh derives unlike sibling types
SnapshotVerifyResult only derives Clone, Debug, PartialEq while all other public types in this module also derive BorshSerialize, BorshDeserialize; if intentional (not meant for wire transmission), a doc comment would clarify this.
Suggested fix:
Add a brief doc comment explaining this is a local verification enum not intended for serialization, or add the Borsh derives for consistency.
There was a problem hiding this comment.
🤖 AI Code Reviewer
Reviewed by 1 agents | Quality score: 33% | Review time: 243.8s
💡 1 suggestions, 📝 2 nitpicks. See inline comments.
🤖 Generated by AI Code Reviewer | Review ID: review-fe0b208b
| pub struct SnapshotEntityPage { | ||
| /// Page number (0-indexed). | ||
| pub page_number: usize, | ||
|
|
There was a problem hiding this comment.
💡 Validation gap when total_pages is zero
When total_pages = 0, both the page_number bounds check and is_last coherence check are skipped, allowing arbitrary page_number values (e.g., page_number = 999999, total_pages = 0 would pass).
Suggested fix:
Consider either rejecting `total_pages = 0` as invalid, or adding `if self.total_pages == 0 && self.page_number != 0 { return false; }` to ensure page_number is 0 when total_pages is 0.
| pub fn is_valid(&self) -> bool { | ||
| // Check entity count limit | ||
| if self.entities.len() > MAX_ENTITIES_PER_PAGE { | ||
| return false; |
There was a problem hiding this comment.
📝 Nit: compression_ratio could avoid casting for clarity
The max(1) prevents division by zero, but the intent would be clearer if the function returned None when uncompressed_size == 0 rather than returning a potentially misleading 100.0 ratio.
Suggested fix:
Consider returning `None` when `uncompressed_size == 0` for semantic correctness: `if self.uncompressed_size == 0 { return None; }`
| // snapshot overwrite. CRDT merge must be used instead. | ||
| let result = self | ||
| .request_snapshot_sync(context_id, peer_id, false) | ||
| .await?; |
There was a problem hiding this comment.
Fallback snapshot leaves stale buffer session
High Severity
fallback_to_snapshot_sync now calls request_snapshot_sync(..., false), which triggers check_snapshot_safety for initialized contexts and returns early. The function has already started a sync session, so end_sync_session is skipped on error, leaving NodeState.sync_sessions active and causing later deltas to be buffered instead of applied.
Additional Locations (1)
| // snapshot overwrite. CRDT merge must be used instead. | ||
| let result = self | ||
| .request_snapshot_sync(context_id, peer_id, false) | ||
| .await?; |
There was a problem hiding this comment.
Divergence fallback is now permanently blocked
High Severity
fallback_to_snapshot_sync now passes force = false, so request_snapshot_sync enforces check_snapshot_safety and rejects initialized contexts. This fallback is invoked for initialized-node divergence and pruned-history cases, so those recovery paths now fail consistently instead of reconciling state.
Additional Locations (1)
|
|
||
| // Check is_last coherence: if is_last, must be the final page | ||
| if self.is_last && self.total_pages > 0 && self.page_number + 1 != self.total_pages { | ||
| return false; |
There was a problem hiding this comment.
Pagination validation accepts inconsistent page states
Medium Severity
SnapshotEntityPage::is_valid() only enforces one direction of is_last coherence and skips bounds checks when total_pages == 0. A page can pass validation with impossible pagination metadata, like non-final pages flagged as final consistency-wise or pages with entities while total_pages is zero.
| && self.page_count <= MAX_SNAPSHOT_PAGES as u64 | ||
| && self.sent_count <= self.page_count | ||
| && self.payload.len() <= MAX_COMPRESSED_PAYLOAD_SIZE | ||
| } |
There was a problem hiding this comment.
Snapshot page validation misses termination invariants
Medium Severity
SnapshotPage::is_valid() does not validate cursor coherence with sent_count and page_count. It accepts pages marked terminal by cursor == None while sent_count < page_count, or non-terminal pages with sent_count == page_count. That allows inconsistent pagination metadata to pass trusted validation.



Summary
Types Added
SnapshotRequestSnapshotEntitySnapshotEntityPageSnapshotCompleteSnapshotVerifyResultEnhanced SnapshotError
New variants for protocol-specific failure modes:
SnapshotOnInitializedNode(Invariant I5 enforcement)RootHashMismatch(Invariant I7 enforcement)TransferInterruptedDecompressionFailedEntityCountMismatchMissingPagesSecurity / DoS Protection
Constants for resource limits:
MAX_ENTITIES_PER_PAGE: 1,000MAX_SNAPSHOT_PAGES: 10,000MAX_SNAPSHOT_PAGE_SIZE: 4 MiBMAX_COMPRESSED_PAYLOAD_SIZE: 8 MiBMAX_ENTITY_DATA_SIZE: 1 MiBMAX_DAG_HEADS: 100All types implement
is_valid()methods with comprehensive validation:sent_count <= page_countcheckpage_number < total_pagesordering invariantis_lastcoherence withpage_numberRuntime Safety (Invariant I5)
check_snapshot_safety()enforced inrequest_snapshot_sync()forceparameter for divergence recovery scenarioshas_context_state_keys()for early-exit state checkroot_hash != [0u8; 32])Invariants
SnapshotCompleteincludes root hash for verificationTest Plan
cargo checkpassescargo clippypassescargo fmtpassesCloses #1778
Supersedes #1874, #1936, #1938
Note
Medium Risk
Touches sync bootstrap/fallback behavior and adds a safety gate that can change when snapshot sync is allowed, which could affect node recovery/bootstrap if the initialization detection is wrong.
Overview
Adds a richer snapshot-sync wire/types surface in
node/primitives(newSnapshotRequest/entity paging/completion/verification types, newSnapshotErrorvariants, and resource-limit constants) and exposes a consistentis_valid()/clamping pattern for data received from untrusted peers.Enforces Invariant I5 at runtime by adding a
forceflag torequest_snapshot_sync()and performing a “has local state” check (datastore keys + context metadata), allowing snapshot overwrite only for fresh-node bootstrap or crash-recovery marker cases. Updates sync manager call sites accordingly and adds an early-exithas_context_state_keys()helper to avoid scanning full keysets just to detect initialization.Written by Cursor Bugbot for commit 320ca5d. This will update automatically on new commits. Configure here.