Skip to content

feat(sync): add unified metrics interface for sync protocol observability#2007

Merged
xilosada merged 2 commits intomasterfrom
feat/sync-metrics-observability
Feb 16, 2026
Merged

feat(sync): add unified metrics interface for sync protocol observability#2007
xilosada merged 2 commits intomasterfrom
feat/sync-metrics-observability

Conversation

@xilosada
Copy link
Member

@xilosada xilosada commented Feb 16, 2026

Summary

Implement a unified metrics interface (SyncMetricsCollector trait) for sync protocol observability that works for both simulation (deterministic benchmarking) and production (Prometheus).

Key changes:

  • Add sync::metrics module with SyncMetricsCollector trait and PhaseTimer
  • Add sync::prometheus_metrics module with production Prometheus implementation
  • Add sync_sim::metrics_adapter bridging the trait to simulation metrics
  • Add sync_sim::benchmarks using the trait for protocol benchmarking
  • Document integration patterns in crates/server/src/metrics.rs

Metrics collected:

  • Protocol cost: messages, bytes, round trips, entities transferred, CRDT merges
  • Phase timing: handshake, data_transfer, merge, sync_total
  • CIP invariant monitoring: I5 (snapshot_blocked), I6 (buffer_drops), I7 (verification_failures)
  • Sync session lifecycle: start, complete, failure

Implementations:

  • NoOpMetrics - Zero-overhead disabled metrics
  • PrometheusSyncMetrics - Production Prometheus with labeled counters/histograms
  • SimMetricsCollector - Simulation adapter for benchmarking

Test plan

  • Unit tests for SyncMetricsCollector trait (3 tests)
  • Unit tests for PrometheusSyncMetrics (3 tests)
  • Unit tests for SimMetricsCollector adapter (9 tests)
  • Integration tests validating trait works with simulation data (2 tests)
  • Protocol benchmarks exercising the trait (6 tests)
  • cargo fmt - pass
  • cargo clippy - no new warnings
  • All 23 metrics-related tests pass

Note

Medium Risk
Mostly additive observability code, but it expands exported sync APIs and introduces Prometheus label/bucket choices that could impact metrics cardinality and performance if misused.

Overview
Introduces a unified sync observability surface via SyncMetricsCollector and PhaseTimer, including a NoOpMetrics implementation and public re-exports from sync::mod.

Adds a production PrometheusSyncMetrics collector registering labeled counters/histograms for protocol cost, phase timing, safety invariants (I5/I6/I7), and session outcomes, with label sanitization to avoid unbounded cardinality.

Extends the sync simulation test suite with a SimMetricsCollector adapter and new deterministic benchmarks/integration tests that exercise the trait end-to-end, and updates server metrics docs to describe wiring sync metrics into the /metrics registry.

Written by Cursor Bugbot for commit 80678b5. This will update automatically on new commits. Configure here.

Copy link

@meroreviewer meroreviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Reviewer

Reviewed by 3 agents | Quality score: 95% | Review time: 203.3s

🟡 4 warnings, 💡 3 suggestions. See inline comments.


🤖 Generated by AI Code Reviewer | Review ID: review-d319aeb9

…lity

Implement a unified metrics interface that works for both simulation
(deterministic benchmarking) and production (Prometheus observability).

New modules:
- `sync::metrics` - SyncMetricsCollector trait with PhaseTimer
- `sync::prometheus_metrics` - Production Prometheus implementation
- `sync_sim::metrics_adapter` - Simulation adapter bridging to SimMetrics
- `sync_sim::benchmarks` - Protocol benchmarks using the trait

Key features:
- Protocol cost metrics: messages, bytes, round trips, entities, merges
- Phase timing: handshake, data_transfer, merge, sync_total
- CIP invariant monitoring: I5 (snapshot_blocked), I6 (buffer_drops),
  I7 (verification_failures)
- NoOpMetrics for zero-overhead disabled metrics
- Full integration testing with simulation framework

All 23 tests pass validating the trait implementation works correctly
with real simulation data.
@xilosada xilosada force-pushed the feat/sync-metrics-observability branch from 614bd9d to e9eed76 Compare February 16, 2026 12:45
Copy link

@meroreviewer meroreviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Reviewer

Reviewed by 3 agents | Quality score: 100% | Review time: 225.0s

🟡 2 warnings, 💡 2 suggestions. See inline comments.


🤖 Generated by AI Code Reviewer | Review ID: review-b87581e4

@cursor

This comment has been minimized.

@xilosada
Copy link
Member Author

@cursor push 1241851

@cursor
Copy link
Contributor

cursor bot commented Feb 16, 2026

Could not push Autofix changes. The PR branch may have changed since the Autofix ran, or the Autofix commit may no longer exist.

- Add protocol parameter to record_sync_complete and record_sync_failure
  to fix hardcoded 'all' label losing cardinality
- Add sanitize_protocol() and sanitize_crdt_type() to prevent unbounded
  label cardinality from untrusted input (whitelist known values)
- Fix O(n) loop in record_entities_transferred by using direct increment
- Implement record_protocol_selected with a counter instead of no-op
- Fix sync_successes_total never being incremented (Bugbot fix)
- Fix benchmark byte totals undercount from integer division (Bugbot fix)
- Fix Box::leak memory leak in scaling benchmarks (Bugbot fix)
Copy link
Contributor

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is ON. A Cloud Agent has been kicked off to fix the reported issue.

Copy link

@meroreviewer meroreviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Reviewer

Reviewed by 3 agents | Quality score: 92% | Review time: 174.2s

💡 4 suggestions. See inline comments.


🤖 Generated by AI Code Reviewer | Review ID: review-55a3f1ea

@cursor
Copy link
Contributor

cursor bot commented Feb 16, 2026

Bugbot Autofix prepared fixes for 1 of the 1 bugs found in the latest run.

  • ✅ Fixed: Convergence timing never recorded in benchmarks
    • Made record_sync_complete in SimMetricsCollector actually store the duration by converting std::time::Duration to SimTime and updating the convergence metrics.

View PR

Or push these changes by commenting:

@cursor push 0884931133
Preview (0884931133)
diff --git a/Cargo.lock b/Cargo.lock
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -1563,24 +1563,6 @@
 ]
 
 [[package]]
-name = "bindgen"
-version = "0.72.1"
-source = "registry+https://github.com/rust-lang/crates.io-index"
-checksum = "993776b509cfb49c750f11b8f07a46fa23e0a1386ffc01fb1e7d343efc387895"
-dependencies = [
- "bitflags 2.10.0",
- "cexpr",
- "clang-sys",
- "itertools 0.13.0",
- "proc-macro2",
- "quote",
- "regex",
- "rustc-hash 2.1.1",
- "shlex",
- "syn 2.0.108",
-]
-
-[[package]]
 name = "binread"
 version = "2.2.0"
 source = "registry+https://github.com/rust-lang/crates.io-index"
@@ -6387,8 +6369,6 @@
  "glob",
  "libc",
  "libz-sys",
- "lz4-sys",
- "zstd-sys",
 ]
 
 [[package]]
@@ -6473,16 +6453,6 @@
 checksum = "112b39cec0b298b6c1999fee3e31427f74f676e4cb9879ed1a121b43661a4154"
 
 [[package]]
-name = "lz4-sys"
-version = "1.11.1+lz4-1.10.0"
-source = "registry+https://github.com/rust-lang/crates.io-index"
-checksum = "6bd8c0d6c6ed0cd30b3652886bb8711dc4bb01d637a68105a3d5158039b418e6"
-dependencies = [
- "cc",
- "libc",
-]
-
-[[package]]
 name = "lz4_flex"
 version = "0.11.5"
 source = "registry+https://github.com/rust-lang/crates.io-index"
@@ -12661,7 +12631,6 @@
 source = "registry+https://github.com/rust-lang/crates.io-index"
 checksum = "91e19ebc2adc8f83e43039e79776e3fda8ca919132d68a1fed6a5faca2683748"
 dependencies = [
- "bindgen 0.72.1",
  "cc",
  "pkg-config",
 ]

diff --git a/crates/node/tests/sync_sim/metrics_adapter.rs b/crates/node/tests/sync_sim/metrics_adapter.rs
--- a/crates/node/tests/sync_sim/metrics_adapter.rs
+++ b/crates/node/tests/sync_sim/metrics_adapter.rs
@@ -144,10 +144,16 @@
         &self,
         _context_id: &str,
         _protocol: &str,
-        _duration: Duration,
+        duration: Duration,
         _entities: usize,
     ) {
-        // Simulation uses convergence metrics instead
+        use super::runtime::SimTime;
+
+        let mut guard = self.metrics.lock().expect("metrics lock poisoned");
+        // Convert std::time::Duration to SimTime (both use microseconds internally)
+        let sim_time = SimTime::from_micros(duration.as_micros() as u64);
+        guard.convergence.converged = true;
+        guard.convergence.time_to_converge = Some(sim_time);
     }
 
     fn record_sync_failure(&self, _context_id: &str, _protocol: &str, _reason: &str) {

@xilosada xilosada merged commit a2983bb into master Feb 16, 2026
22 checks passed
@xilosada xilosada deleted the feat/sync-metrics-observability branch February 16, 2026 13:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants