forkchoice grafana visualization #426

chetanyb · 2025-12-14T19:56:17Z

Summary

Adds real-time fork choice tree visualization via /api/forkchoice/graph endpoint with thread-safety fixes to forkchoice module. Implements locked/unlocked pattern for 5 critical functions and eliminates unsafe direct field access.

Key Features

Visualization API

New HTTP endpoint serving Grafana node-graph compatible JSON
Configurable history via ?slots=N (default: 50, max: 200)
Color-coded arc borders representing consensus states:
- 🟣 Purple: Finalized blocks (canonical chain)
- 🔵 Blue: Justified checkpoint
- 🟠 Orange: Current head
- 🟢 Green: Normal blocks (refer TODO)
- ⚫ Gray: Orphaned blocks (historical forks)
Arc completion represents validator weight
New endpoint /lean/states/finalized (SSZ checkpoint state).
Unified API server for /metrics, /health, /events, /api/forkchoice/graph, /lean/states/finalized.

Thread Safety

Locking Pattern Implementation

Fixed 5 functions missing thread-safety protection:
- getCanonicalView (READ operations - lockShared)
- getCanonicalityAnalysis (READ operations - lockShared)
- getCanonicalAncestorAtDepth (READ operations - lockShared)
- rebase (WRITE operations - exclusive lock)
- confirmBlock (WRITE operations - exclusive lock)
Implemented locked/unlocked pattern:
- Internal *Unlocked versions for efficient internal calls
- Public wrappers with appropriate locks
- No nested locking patterns detected

Eliminated Direct Field Access

Replaced unsafe direct access in chain.zig with thread-safe getters:
- self.forkChoice.head → getHead()
- self.forkChoice.fcStore → getLatestJustified(), getLatestFinalized()
- self.forkChoice.protoArray.* → getNodeCount(), getBlockSlot()

Concurrency Benefits

Snapshot uses shared lock (allows concurrent reads)
Multiple readers don't block each other
Visualization doesn't block block processing

API Server Reliability

Per-connection arena allocator prevents memory leaks on request failures
Made server stoppable with proper lifecycle management
Fixed potential use-after-free when accessing forkchoice pointer
Non-blocking accept loop with graceful shutdown
SSE lifecycle fixes:
- heartbeat uses mutexed connection writes
- broadcaster no longer frees connections on send failure
- safe shutdown drain for SSE threads
- removeGlobalConnection for cleanup
Rate limiter eviction is safe (no mutation during iteration).
chain can be attached after startup via handle.

Security & Performance

Max slot cap (200) prevents excessive memory/lock time
Shared locks minimize contention
Lock-free JSON processing on snapshot copy
Per‑IP rate limiting + global in‑flight cap for graph endpoint (429 on exceed).
SSE connection cap (32) to avoid unbounded threads (503 on exceed).
Rate‑limit map bounded (256 IPs) with TTL cleanup + cooldown.

Screenshot sample

Grafana panel showing fork-choice tree visualization

NOTES

Companion Grafana dashboards
Can be useful for UX - Node Monitoring Visual Tool #292

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

chetanyb · 2025-12-14T19:58:26Z

@g11tech Current design allows the observability API to briefly hold shared locks. Should we add rate limiting to prevent potential write starvation from excessive requests, or is the max slot cap sufficient?

pkgs/cli/src/api_server.zig

pkgs/node/src/tree_visualizer.zig

pkgs/cli/src/api_server.zig

g11tech · 2026-01-25T08:49:59Z

pkgs/cli/src/api_server.zig

+        while (blk: {
+            self.sse_mutex.lock();
+            defer self.sse_mutex.unlock();
+            break :blk self.sse_active != 0;


why don't we have sse_active as an atomic value as well

sse_active enforces a cap (MAX_SSE_CONNECTIONS), so we keep it behind a mutex to make the check+increment atomic and simple. Making it atomic would require a CAS loop (or rollback on overflow) to preserve the cap safely.

g11tech · 2026-01-25T08:50:31Z

pkgs/cli/src/api_server.zig

    /// Handle finalized checkpoint state endpoint
    /// Serves the finalized checkpoint lean state (BeamState) as SSZ octet-stream at /lean/states/finalized
-    fn handleFinalizedCheckpointState(self: *const Self, request: *std.http.Server.Request) !void {
+    fn handleFinalizedCheckpointState(self: *ApiServer, request: *std.http.Server.Request) !void {


why not Self?

Fixed, thanks for pointing out!

g11tech · 2026-01-25T08:55:18Z

pkgs/cli/src/main.zig

                .node_registry = registry_1,
            });

+            if (api_server_handle) |handle| {


make and later work on an issue to move creation of server post chain creation so we don't have to set it later and can pass chain it during service creation

g11tech · 2026-01-25T08:58:36Z

pkgs/node/src/chain.zig

        // 5 Rebase forkchouce
        if (pruneForkchoice)
-            try self.forkChoice.rebase(latestFinalized.root, &canonical_view);
+            try self.forkChoice.rebase(latestFinalized.root, null);


why this change?

canonical_view can be stale by the time rebase() runs, so I pass null to force recomputation under rebase’s write lock. Holding the forkchoice lock across processFinalizationAdvancement would avoid recomputation but would block forkchoice updates/reads (onBlock/onAttestation/onInterval/snapshot) while DB/pruning work runs.

why would it be stale? only in syncing I imagine it to be so rapidly changing, also we need to make processFinalizationAdvancement atomic, so probably getCanonicalViewAndAnalysis should also lock (so have getCanonicalViewAndAnalysisAndLock fn I guess) and release the lock in the end of processFinalizationAdvancement (defer)

yeah, staleness can still happen outside sync, from my understanding, the rust bridge can run onBlock/onAttestation while this is in flight. If we held the forkchoice write lock across DB/pruning to keep it atomic, we’d block all forkchoice reads/writes (and the lock isn’t re‑entrant). Passing null just makes rebase() recompute under its own write lock, so it uses the current tree at prune time. If we want stronger guarantees, we can add an atomic analyze+rebase API later as part of a bigger concurrency update.

…kchoice-graph

This reverts commit 854ca5d.

ScottyPoi and others added 24 commits October 13, 2025 19:25

create forkchoice tree visualization function

00b6bbe

build tree visualization in printSlot function

5ae1e86

log forkchoice tree visual during printSlot

43c6ec9

add configurable depth limit

7e17b92

add tree_depth parameter to printSlot

3f79578

delete unused helper

37a1ff6

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update pkgs/node/src/tree_visualizer.zig

aad8a67

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

replace catch returns with try

7a3ba81

replace arrayList with ArrayListUnmanaged

6ce1ce7

Remove // comments from visualizer logs

eb12779

Simplify slot log

a2847dc

Remove spacing before brackets

49b6f35

Remove 0x prefix

75caca7

Use two digits for root

a7a7102

fix ArrayListUnmanaged usage

628f5fe

fix build error

7451031

Merge remote-tracking branch 'origin/main' into forkchoice-visual

5945263

Merge branch 'main' into forkchoice-visual

4bda409

feature: add node graph visualization for forkchoice

eeb7791

feat: add validator weight based arcs

47b8c7e

refactor: make forkchoice thread-safe and optimize graph api

6e7d143

feat: add orbhan block check and marking in forkchoice visualization

f7c1b9d

Merge branch 'main' of github.com:blockblaz/zeam into forkchoice-graph

114f536

chore: lint fix

057bc5a

chetanyb requested review from anshalshukla, bomanaps and g11tech December 14, 2025 19:56

anshalshukla requested changes Dec 15, 2025

View reviewed changes

g11tech changed the title ~~Fork Choice Visualization~~ forkchoice grafana visualization Dec 16, 2025

chetanyb and others added 2 commits January 22, 2026 17:23

Merge branch 'main' of github.com:blockblaz/zeam into forkchoice-graph

6ad6886

Merge branch 'main' into forkchoice-graph

22218fe

g11tech reviewed Jan 22, 2026

View reviewed changes