Skip to content

feat(sync): 011 - Snapshot Sync (Fresh Nodes Only)#1933

Merged
xilosada merged 5 commits intomasterfrom
feat/sync-011-snapshot-sync
Feb 11, 2026
Merged

feat(sync): 011 - Snapshot Sync (Fresh Nodes Only)#1933
xilosada merged 5 commits intomasterfrom
feat/sync-011-snapshot-sync

Conversation

@xilosada
Copy link
Member

@xilosada xilosada commented Feb 11, 2026

Summary

Types Added

Type Purpose
SnapshotRequest Initiate snapshot transfer (compressed/uncompressed)
SnapshotEntity Single entity with metadata and collection info
SnapshotEntityPage Paginated entities for transfer
SnapshotComplete Completion marker with root hash and DAG heads
SnapshotVerifyResult Verification result enum

Enhanced SnapshotError

New variants for protocol-specific failure modes:

  • SnapshotOnInitializedNode (Invariant I5 enforcement)
  • RootHashMismatch (Invariant I7 enforcement)
  • TransferInterrupted
  • DecompressionFailed
  • EntityCountMismatch
  • MissingPages

Security / DoS Protection

Constants for resource limits:

  • MAX_ENTITIES_PER_PAGE: 1,000
  • MAX_SNAPSHOT_PAGES: 10,000
  • MAX_SNAPSHOT_PAGE_SIZE: 4 MiB
  • MAX_COMPRESSED_PAYLOAD_SIZE: 8 MiB
  • MAX_ENTITY_DATA_SIZE: 1 MiB
  • MAX_DAG_HEADS: 100

All types implement is_valid() methods with comprehensive validation:

  • sent_count <= page_count check
  • page_number < total_pages ordering invariant
  • is_last coherence with page_number
  • Compressed payload size validation

Runtime Safety (Invariant I5)

  • check_snapshot_safety() enforced in request_snapshot_sync()
  • force parameter for divergence recovery scenarios
  • Efficient has_context_state_keys() for early-exit state check
  • Context metadata freshness check (root_hash != [0u8; 32])

Invariants

  • I5 (Snapshot Safety): Prevents snapshot application on initialized nodes
  • I7 (Root Hash Verification): SnapshotComplete includes root hash for verification

Test Plan

  • All 140 sync tests pass
  • cargo check passes
  • cargo clippy passes
  • cargo fmt passes

Closes #1778
Supersedes #1874, #1936, #1938


Note

Medium Risk
Touches sync bootstrap/fallback behavior and adds a safety gate that can change when snapshot sync is allowed, which could affect node recovery/bootstrap if the initialization detection is wrong.

Overview
Adds a richer snapshot-sync wire/types surface in node/primitives (new SnapshotRequest/entity paging/completion/verification types, new SnapshotError variants, and resource-limit constants) and exposes a consistent is_valid()/clamping pattern for data received from untrusted peers.

Enforces Invariant I5 at runtime by adding a force flag to request_snapshot_sync() and performing a “has local state” check (datastore keys + context metadata), allowing snapshot overwrite only for fresh-node bootstrap or crash-recovery marker cases. Updates sync manager call sites accordingly and adds an early-exit has_context_state_keys() helper to avoid scanning full keysets just to detect initialization.

Written by Cursor Bugbot for commit 320ca5d. This will update automatically on new commits. Configure here.

Add snapshot sync types for fresh node bootstrap (CIP §6).

Types Added:
- SnapshotRequest: Initiate snapshot transfer (compressed/uncompressed)
- SnapshotEntity: Single entity with metadata and collection info
- SnapshotEntityPage: Paginated entities for transfer
- SnapshotComplete: Completion marker with root hash and DAG heads
- SnapshotVerifyResult: Verification result enum

Enhanced SnapshotError with new variants:
- SnapshotOnInitializedNode (Invariant I5)
- RootHashMismatch (Invariant I7)
- TransferInterrupted
- DecompressionFailed

Security:
- DoS protection constants (MAX_ENTITIES_PER_PAGE, MAX_SNAPSHOT_PAGES, etc.)
- Validation via is_valid() methods on all types
- check_snapshot_safety() function for I5 enforcement

Tests:
- 54 comprehensive tests covering all types
- Boundary condition tests (at-limit, over-limit)
- Security/exploit prevention tests
- Special values tests (zeros, max values)
- Serialization roundtrip tests
- Invariant enforcement tests (I5, I7)

Closes #1778
Copy link

@meroreviewer meroreviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Reviewer

Reviewed by 1 agents | Quality score: 33% | Review time: 110.4s

🟡 1 warnings, 💡 1 suggestions, 📝 2 nitpicks. See inline comments.


🤖 Generated by AI Code Reviewer | Review ID: review-fe768c09

@cursor

This comment has been minimized.

@xilosada
Copy link
Member Author

@cursor push bc8dec4

…issues

- Reduce DoS limits to reasonable values:
  - MAX_ENTITIES_PER_PAGE: 10,000 -> 1,000
  - MAX_SNAPSHOT_PAGES: 1,000,000 -> 10,000 (~2.5GB at 256KB/page)
  - MAX_DAG_HEADS: 1,000 -> 100
  - Add MAX_COMPRESSED_PAYLOAD_SIZE (8MB) for payload validation

- Add SnapshotPage::is_valid() checks:
  - Validate sent_count <= page_count (impossible state prevention)
  - Validate compressed payload size against MAX_COMPRESSED_PAYLOAD_SIZE

- Add SnapshotEntityPage::is_valid() ordering invariants:
  - Check page_number < total_pages
  - Check is_last coherence with page_number

- Fix SnapshotVerifyResult::to_error() to preserve error types:
  - Add SnapshotError::EntityCountMismatch variant
  - Add SnapshotError::MissingPages variant
  - Map verification results to specific errors instead of InvalidBoundary

- Wire check_snapshot_safety() into runtime:
  - Call in request_snapshot_sync() before applying snapshot
  - Allow crash recovery (sync-in-progress marker) to bypass check

- Add comprehensive tests for new validation logic

Applied via @cursor push command
Copy link

@meroreviewer meroreviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Reviewer

Reviewed by 1 agents | Quality score: 33% | Review time: 102.4s

🟡 1 warnings, 💡 1 suggestions, 📝 1 nitpicks. See inline comments.


🤖 Generated by AI Code Reviewer | Review ID: review-80d889d1

/// Whether this is the last page.
pub is_last: bool,
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 SnapshotEntityPage validation allows invalid state when total_pages == 0

When total_pages == 0, the bounds check page_number >= total_pages and the is_last coherence check are both skipped, allowing pages like {page_number: 999, total_pages: 0, is_last: true} to pass validation.

Suggested fix:

Add an explicit check: if `total_pages == 0`, either reject as invalid or only allow `page_number == 0`. Consider: `if self.total_pages == 0 && self.page_number != 0 { return false; }`

/// Whether the initiator is definitely a fresh node (for safety check).
/// If false, responder SHOULD verify this claim.
pub is_fresh_node: bool,
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 SnapshotRequest.is_fresh_node is trust-on-claim with no verification mechanism

The comment says responder 'SHOULD verify this claim' but no verification API is provided; receivers must either trust the claim or implement their own check.

Suggested fix:

Consider adding a `verify_fresh_node` callback or documenting the expected verification approach in the struct-level docs.

) -> Self {
Self {
root_hash,
total_entities,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Nit: compression_ratio name may mislead—returns compressed/uncompressed

A ratio <1 means good compression; consider documenting that explicitly or renaming to compressed_to_uncompressed_ratio.

Suggested fix:

Add doc comment clarifying that values <1 indicate effective compression.

@cursor

This comment has been minimized.

- Add 'force' parameter to request_snapshot_sync to allow divergence
  recovery paths that have existing state (bug 1)
- Replace collect_context_state_keys().is_empty() with efficient
  has_context_state_keys() that exits early on first match (bug 2)
- Check both state keys AND context metadata (root_hash) to properly
  detect initialized contexts, even after deletes (bug 3)

The safety check now properly allows:
- Fresh nodes (force=false): enforces no existing state
- Crash recovery: bypasses check when sync-in-progress marker exists
- Divergence recovery (force=true): explicitly allows existing state
This was referenced Feb 11, 2026
Copy link

@meroreviewer meroreviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Reviewer

Reviewed by 1 agents | Quality score: 33% | Review time: 253.9s

🟡 1 warnings, 💡 1 suggestions, 📝 2 nitpicks. See inline comments.


🤖 Generated by AI Code Reviewer | Review ID: review-987b53ea

return false;
}

// Check page number is within bounds (page_number is 0-indexed)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Validation allows total_pages=0 with non-empty entities

When total_pages is 0, the page_number and is_last coherence checks are both skipped, allowing semantically invalid states like having entities but claiming zero total pages.

Suggested fix:

Add a check: `if self.total_pages == 0 && !self.entities.is_empty() { return false; }` or require `total_pages >= 1` when entities exist.

/// Maximum page size in bytes (0 = use responder's default).
pub max_page_size: u32,

/// Whether the initiator is definitely a fresh node (for safety check).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 is_fresh_node field is declared but never validated

The comment states 'responder SHOULD verify this claim' but no receiver-side validation code uses this field; the actual safety check uses check_snapshot_safety(has_local_state) instead.

Suggested fix:

Either remove the field if responders will always check their own state, add a TODO comment explaining when validation will be implemented, or document that this is metadata-only.

#[must_use]
pub fn new(
root_hash: [u8; 32],
total_entities: usize,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Nit: compression_ratio returns misleading value when uncompressed_size is 0

When uncompressed_size is 0, the ratio returns compressed_size/1 which is semantically incorrect (e.g., returns 100.0 for 100 compressed bytes with 0 uncompressed).

Suggested fix:

Consider returning None when uncompressed_size is 0, or document this edge case behavior in the doc comment.


/// Check if page is within valid bounds.
#[must_use]
pub fn is_valid(&self) -> bool {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Nit: SnapshotPage.is_valid() doesn't validate cursor coherence

Unlike SnapshotEntityPage, SnapshotPage.is_valid() doesn't check if cursor being None correlates with sent_count == page_count (last page condition).

Suggested fix:

Consider adding: `&& (self.cursor.is_some() || self.sent_count == self.page_count)` for consistency with is_last() semantics.

@cursor
Copy link
Contributor

cursor bot commented Feb 11, 2026

Bugbot Autofix prepared fixes for 1 of the 1 bugs found in the latest run.

  • ✅ Fixed: Fresh-node safety check can be bypassed
    • Changed force=true to force=false in fallback_to_snapshot_sync to enforce Invariant I5 and prevent snapshot overwrites on initialized nodes during divergence handling.

View PR

Or push these changes by commenting:

@cursor push 0dea6f7797
Preview (0dea6f7797)
diff --git a/crates/node/src/sync/manager.rs b/crates/node/src/sync/manager.rs
--- a/crates/node/src/sync/manager.rs
+++ b/crates/node/src/sync/manager.rs
@@ -1404,9 +1404,12 @@
         self.node_state
             .start_sync_session(context_id, sync_start_hlc);
 
-        // force=true: This is divergence recovery where existing state is expected
+        // force=false: Enforce Invariant I5 - only allow snapshot on fresh nodes.
+        // If the node has state, this will fail, which is correct - divergence
+        // or pruned history on initialized nodes cannot be safely resolved via
+        // snapshot overwrite. CRDT merge must be used instead.
         let result = self
-            .request_snapshot_sync(context_id, peer_id, true)
+            .request_snapshot_sync(context_id, peer_id, false)
             .await?;
         info!(%context_id, records = result.applied_records, "Snapshot sync completed");
 

diff --git a/crates/node/src/sync/snapshot.rs b/crates/node/src/sync/snapshot.rs
--- a/crates/node/src/sync/snapshot.rs
+++ b/crates/node/src/sync/snapshot.rs
@@ -249,8 +249,9 @@
 
         // Check Invariant I5: Snapshot sync should only be used for fresh nodes
         // OR for crash recovery (detected by sync-in-progress marker).
-        // OR when explicitly forced for divergence recovery.
         // This prevents accidental state overwrites on initialized nodes.
+        // NOTE: force=true is reserved for exceptional cases like test fixtures;
+        // divergence recovery must NOT bypass this check (see I5).
         let is_crash_recovery = self.check_sync_in_progress(context_id)?.is_some();
         if !force && !is_crash_recovery {
             // Check both state keys and context metadata to determine initialization.

@xilosada xilosada merged commit b0215f5 into master Feb 11, 2026
5 checks passed
@xilosada xilosada deleted the feat/sync-011-snapshot-sync branch February 11, 2026 12:59
Copy link

@meroreviewer meroreviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Reviewer

Reviewed by 1 agents | Quality score: 33% | Review time: 120.0s

🟡 1 warnings, 💡 2 suggestions, 📝 1 nitpicks. See inline comments.


🤖 Generated by AI Code Reviewer | Review ID: review-3d76bab1

return false;
}

// Check page number is within bounds (page_number is 0-indexed)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Validation skips checks when total_pages is zero

When total_pages == 0, the page_number >= total_pages and is_last coherence checks are both skipped, allowing semantically invalid states like page_number=100, total_pages=0, is_last=true to pass validation.

Suggested fix:

Either disallow `total_pages == 0` (return false if total_pages == 0), or document that it represents 'unknown total' and add a comment explaining this intentional behavior.

assert!(page.is_valid());
}

#[test]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Missing test coverage for total_pages=0 edge case

The test_snapshot_entity_page_validation test covers many edge cases but omits testing SnapshotEntityPage when total_pages == 0, which could allow unexpected validation results.

Suggested fix:

Add a test case like `SnapshotEntityPage::new(0, 0, vec![], true)` and `SnapshotEntityPage::new(100, 0, vec![], false)` to document expected behavior.

/// Maximum page size in bytes (0 = use responder's default).
pub max_page_size: u32,

/// Whether the initiator is definitely a fresh node (for safety check).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 is_fresh_node field verification is ambiguous

The comment says responder 'SHOULD verify this claim' but there's no enforcement mechanism or verification logic provided; this could lead to inconsistent implementations.

Suggested fix:

Either remove the `is_fresh_node` field (since the receiver can't trust it anyway), or document what verification the responder should perform.

// Snapshot Verification (Invariant I7)
// =============================================================================

/// Result of verifying a snapshot.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Nit: SnapshotVerifyResult lacks Borsh derives unlike sibling types

SnapshotVerifyResult only derives Clone, Debug, PartialEq while all other public types in this module also derive BorshSerialize, BorshDeserialize; if intentional (not meant for wire transmission), a doc comment would clarify this.

Suggested fix:

Add a brief doc comment explaining this is a local verification enum not intended for serialization, or add the Borsh derives for consistency.

Copy link

@meroreviewer meroreviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Reviewer

Reviewed by 1 agents | Quality score: 33% | Review time: 243.8s

💡 1 suggestions, 📝 2 nitpicks. See inline comments.


🤖 Generated by AI Code Reviewer | Review ID: review-fe0b208b

pub struct SnapshotEntityPage {
/// Page number (0-indexed).
pub page_number: usize,

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Validation gap when total_pages is zero

When total_pages = 0, both the page_number bounds check and is_last coherence check are skipped, allowing arbitrary page_number values (e.g., page_number = 999999, total_pages = 0 would pass).

Suggested fix:

Consider either rejecting `total_pages = 0` as invalid, or adding `if self.total_pages == 0 && self.page_number != 0 { return false; }` to ensure page_number is 0 when total_pages is 0.

pub fn is_valid(&self) -> bool {
// Check entity count limit
if self.entities.len() > MAX_ENTITIES_PER_PAGE {
return false;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Nit: compression_ratio could avoid casting for clarity

The max(1) prevents division by zero, but the intent would be clearer if the function returned None when uncompressed_size == 0 rather than returning a potentially misleading 100.0 ratio.

Suggested fix:

Consider returning `None` when `uncompressed_size == 0` for semantic correctness: `if self.uncompressed_size == 0 { return None; }`

Copy link
Contributor

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 4 potential issues.

Bugbot Autofix is ON, but it could not run because the branch was deleted or merged before Autofix could start.

// snapshot overwrite. CRDT merge must be used instead.
let result = self
.request_snapshot_sync(context_id, peer_id, false)
.await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fallback snapshot leaves stale buffer session

High Severity

fallback_to_snapshot_sync now calls request_snapshot_sync(..., false), which triggers check_snapshot_safety for initialized contexts and returns early. The function has already started a sync session, so end_sync_session is skipped on error, leaving NodeState.sync_sessions active and causing later deltas to be buffered instead of applied.

Additional Locations (1)

Fix in Cursor Fix in Web

// snapshot overwrite. CRDT merge must be used instead.
let result = self
.request_snapshot_sync(context_id, peer_id, false)
.await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Divergence fallback is now permanently blocked

High Severity

fallback_to_snapshot_sync now passes force = false, so request_snapshot_sync enforces check_snapshot_safety and rejects initialized contexts. This fallback is invoked for initialized-node divergence and pruned-history cases, so those recovery paths now fail consistently instead of reconciling state.

Additional Locations (1)

Fix in Cursor Fix in Web


// Check is_last coherence: if is_last, must be the final page
if self.is_last && self.total_pages > 0 && self.page_number + 1 != self.total_pages {
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pagination validation accepts inconsistent page states

Medium Severity

SnapshotEntityPage::is_valid() only enforces one direction of is_last coherence and skips bounds checks when total_pages == 0. A page can pass validation with impossible pagination metadata, like non-final pages flagged as final consistency-wise or pages with entities while total_pages is zero.

Fix in Cursor Fix in Web

&& self.page_count <= MAX_SNAPSHOT_PAGES as u64
&& self.sent_count <= self.page_count
&& self.payload.len() <= MAX_COMPRESSED_PAYLOAD_SIZE
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Snapshot page validation misses termination invariants

Medium Severity

SnapshotPage::is_valid() does not validate cursor coherence with sent_count and page_count. It accepts pages marked terminal by cursor == None while sent_count < page_count, or non-terminal pages with sent_count == page_count. That allows inconsistent pagination metadata to pass trusted validation.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Sync Protocol] 011: Snapshot Sync (Fresh Nodes Only)

2 participants