Skip to content

Conversation

@chetanyb
Copy link
Contributor

@chetanyb chetanyb commented Dec 14, 2025

Summary

Adds real-time fork choice tree visualization via /api/forkchoice/graph endpoint with thread-safety fixes to forkchoice module. Implements locked/unlocked pattern for 5 critical functions and eliminates unsafe direct field access.

Key Features

Visualization API

  • New HTTP endpoint serving Grafana node-graph compatible JSON
  • Configurable history via ?slots=N (default: 50, max: 200)
  • Color-coded arc borders representing consensus states:
    • 🟣 Purple: Finalized blocks (canonical chain)
    • 🔵 Blue: Justified checkpoint
    • 🟠 Orange: Current head
    • 🟢 Green: Normal blocks (refer TODO)
    • ⚫ Gray: Orphaned blocks (historical forks)
  • Arc completion represents validator weight
  • New endpoint /lean/states/finalized (SSZ checkpoint state).
  • Unified API server for /metrics, /health, /events, /api/forkchoice/graph, /lean/states/finalized.

Thread Safety

Locking Pattern Implementation

  • Fixed 5 functions missing thread-safety protection:

    • getCanonicalView (READ operations - lockShared)
    • getCanonicalityAnalysis (READ operations - lockShared)
    • getCanonicalAncestorAtDepth (READ operations - lockShared)
    • rebase (WRITE operations - exclusive lock)
    • confirmBlock (WRITE operations - exclusive lock)
  • Implemented locked/unlocked pattern:

    • Internal *Unlocked versions for efficient internal calls
    • Public wrappers with appropriate locks
    • No nested locking patterns detected

Eliminated Direct Field Access

  • Replaced unsafe direct access in chain.zig with thread-safe getters:
    • self.forkChoice.headgetHead()
    • self.forkChoice.fcStoregetLatestJustified(), getLatestFinalized()
    • self.forkChoice.protoArray.*getNodeCount(), getBlockSlot()

Concurrency Benefits

  • Snapshot uses shared lock (allows concurrent reads)
  • Multiple readers don't block each other
  • Visualization doesn't block block processing

API Server Reliability

  • Per-connection arena allocator prevents memory leaks on request failures
  • Made server stoppable with proper lifecycle management
  • Fixed potential use-after-free when accessing forkchoice pointer
  • Non-blocking accept loop with graceful shutdown
  • SSE lifecycle fixes:
    • heartbeat uses mutexed connection writes
    • broadcaster no longer frees connections on send failure
    • safe shutdown drain for SSE threads
    • removeGlobalConnection for cleanup
  • Rate limiter eviction is safe (no mutation during iteration).
  • chain can be attached after startup via handle.

Security & Performance

  • Max slot cap (200) prevents excessive memory/lock time
  • Shared locks minimize contention
  • Lock-free JSON processing on snapshot copy
  • Per‑IP rate limiting + global in‑flight cap for graph endpoint (429 on exceed).
  • SSE connection cap (32) to avoid unbounded threads (503 on exceed).
  • Rate‑limit map bounded (256 IPs) with TTL cleanup + cooldown.

Screenshot sample

Grafana panel showing fork-choice tree visualization

NOTES

@chetanyb
Copy link
Contributor Author

@g11tech Current design allows the observability API to briefly hold shared locks. Should we add rate limiting to prevent potential write starvation from excessive requests, or is the max slot cap sufficient?

@g11tech g11tech changed the title Fork Choice Visualization forkchoice grafana visualization Dec 16, 2025
while (blk: {
self.sse_mutex.lock();
defer self.sse_mutex.unlock();
break :blk self.sse_active != 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we have sse_active as an atomic value as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sse_active enforces a cap (MAX_SSE_CONNECTIONS), so we keep it behind a mutex to make the check+increment atomic and simple. Making it atomic would require a CAS loop (or rollback on overflow) to preserve the cap safely.

/// Handle finalized checkpoint state endpoint
/// Serves the finalized checkpoint lean state (BeamState) as SSZ octet-stream at /lean/states/finalized
fn handleFinalizedCheckpointState(self: *const Self, request: *std.http.Server.Request) !void {
fn handleFinalizedCheckpointState(self: *ApiServer, request: *std.http.Server.Request) !void {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not Self?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks for pointing out!

.node_registry = registry_1,
});

if (api_server_handle) |handle| {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make and later work on an issue to move creation of server post chain creation so we don't have to set it later and can pass chain it during service creation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// 5 Rebase forkchouce
if (pruneForkchoice)
try self.forkChoice.rebase(latestFinalized.root, &canonical_view);
try self.forkChoice.rebase(latestFinalized.root, null);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

canonical_view can be stale by the time rebase() runs, so I pass null to force recomputation under rebase’s write lock. Holding the forkchoice lock across processFinalizationAdvancement would avoid recomputation but would block forkchoice updates/reads (onBlock/onAttestation/onInterval/snapshot) while DB/pruning work runs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would it be stale? only in syncing I imagine it to be so rapidly changing, also we need to make processFinalizationAdvancement atomic, so probably getCanonicalViewAndAnalysis should also lock (so have getCanonicalViewAndAnalysisAndLock fn I guess) and release the lock in the end of processFinalizationAdvancement (defer)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, staleness can still happen outside sync, from my understanding, the rust bridge can run onBlock/onAttestation while this is in flight. If we held the forkchoice write lock across DB/pruning to keep it atomic, we’d block all forkchoice reads/writes (and the lock isn’t re‑entrant). Passing null just makes rebase() recompute under its own write lock, so it uses the current tree at prune time. If we want stronger guarantees, we can add an atomic analyze+rebase API later as part of a bigger concurrency update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants