Skip to content

feat: route topology handoffs via Nexus IPC (#165)#255

Merged
windoliver merged 7 commits intomainfrom
feat/165-ipc-handoff-routing
Apr 15, 2026
Merged

feat: route topology handoffs via Nexus IPC (#165)#255
windoliver merged 7 commits intomainfrom
feat/165-ipc-handoff-routing

Conversation

@windoliver
Copy link
Copy Markdown
Owner

Summary

  • Route topology-driven handoffs through Nexus IPC instead of fire-and-forget EventBus
  • Map IPC delivery lifecycle (pending → delivered → processed → replied / dead_lettered / expired) into Grove-visible handoff state
  • New NexusIpcClient shared abstraction eliminates duplicate IPC send paths
  • New MCP tools: grove_list_dead_letters, grove_ack_handoff
  • 179 new/updated tests, TUI e2e verified via tmux capture-pane with real coder-reviewer loop

What changed

Core refactor:

  • EventBus.publish() → async, returns PublishResult with IPC message ID
  • TopologyRouter.route() → async, returns RouteResult[], parallel sends via Promise.all
  • NexusEventBus delegates to NexusIpcClient; LocalEventBus returns {ok: true}
  • NexusWsBridge.send() uses NexusIpcClient when injected (DRY)

Handoff state machine:

  • Extended HandoffStatus: added processed, dead_lettered
  • canTransition(from, to) enforces valid state transitions
  • ipcMessageId field on Handoff for IPC traceability
  • markProcessed(), markDeadLettered(), setIpcMessageId() on all 3 store impls

IPC lifecycle wiring:

  • SSE message_deliveredhandoffStore.markDelivered() in NexusWsBridge
  • IPC delivery rejection → markDeadLettered() in contribute.ts routing
  • Infrastructure errors (404, connection refused) → cached, NOT dead-lettered
  • grove_ack_handoff tool for agent delivered → processed acknowledgment

Cleanup:

  • Replaced 6 inline appendFileSync debug blocks with debugLog()
  • Renamed casUpdatereadModifyWrite (honest naming — CAS is broken)
  • Added in-memory handoff cache with 5s TTL + write invalidation

Test plan

  • 179 tests passing (unit + conformance + integration)
  • HandoffStore conformance suite runs against InMemory + Nexus implementations
  • State machine: exhaustive canTransition() tests for all 6×6 transitions
  • NexusIpcClient: endpoint caching, infrastructure error detection
  • Integration: contribute → handoff → IPC route → status update → dead-letter path
  • TUI e2e via tmux capture-pane: full coder-reviewer loop, both handoffs show 📬 delivered, zero dead_lettered
  • TypeScript compiles clean, existing 4104 tests unaffected

Closes #165

…o Grove (#165)

Unify the EventBus and Nexus IPC delivery paths so topology-driven
handoffs flow through Nexus IPC with traceable delivery state.

Architecture:
- EventBus.publish() is now async, returning PublishResult with IPC
  message ID (NexusEventBus uses NexusIpcClient; LocalEventBus returns
  {ok: true} synchronously)
- TopologyRouter.route() is async, returns RouteResult[] with per-target
  message IDs, sends all targets in parallel via Promise.all
- NexusIpcClient extracts the shared POST /api/v2/ipc/send logic from
  NexusEventBus and NexusWsBridge (DRY)

Handoff state machine:
- Add processed and dead_lettered to HandoffStatus enum
- Add canTransition(from, to) state machine with exhaustive tests
- Add ipcMessageId field to Handoff interface
- Add markProcessed(), markDeadLettered(), setIpcMessageId() to stores
- Happy path: pending_pickup → delivered → processed → replied
- Failure: pending_pickup → dead_lettered, delivered → expired

Storage:
- NexusHandoffStore gains in-memory cache with 5s TTL, invalidated on
  writes and available for SSE event invalidation
- Rename casUpdate → readModifyWrite (honest naming — CAS is broken)
- IPC message IDs linked back to handoffs after routing (best-effort)

MCP tools:
- grove_list_handoffs description updated for IPC state awareness
- Status enum extended with processed and dead_lettered
- New grove_list_dead_letters tool for DLQ visibility

Cleanup:
- Replace 6 inline appendFileSync debug blocks in NexusWsBridge with
  shared debugLog() from tui/debug-log.ts

Tests: 166 new/updated tests across 9 test files, all passing.
…g, agent ack (#165)

Wire the remaining functional gaps in the IPC handoff lifecycle:

SSE → handoff status updates (Gap 1):
- NexusWsBridge.handleEvent now calls handoffStore.markDelivered()
  when a message_delivered SSE event arrives, matching by ipcMessageId
- Bridge accepts handoffStore and ipcClient via options

Dead-lettering on IPC failure (Gap 2):
- contribute.ts routing block checks RouteResult.ok — when false,
  marks the handoff as dead_lettered with stderr warning
- RouteResult now includes ok + error fields from PublishResult

Agent acknowledgment (Gap 3):
- New grove_ack_handoff MCP tool transitions delivered → processed
- Uses canTransition() to enforce state machine rules
- Returns previous and new status for visibility

NexusWsBridge uses NexusIpcClient (Gap 4):
- send() delegates to ipcClient when injected, falls back to
  inline fetch for backward compat

Integration test (Gap 5):
- 10 tests covering: contribute → handoff → IPC → status updates,
  dead-letter path, state machine transitions, multi-target parallel
  routing, LocalEventBus vs NexusEventBus behavior
When the Nexus IPC endpoint (/api/v2/ipc/send) returns 404 or is
unreachable, that's an infrastructure issue (endpoint doesn't exist on
this Nexus version), not a delivery rejection. Handoffs should stay in
their current status and fall back to the session orchestrator's polling
path — not be dead-lettered.

Changes:
- IpcSendResult gains infrastructureError flag, set on 404/405/502/503
  and connection errors
- NexusIpcClient caches endpoint unavailability after first failure to
  avoid repeated failed fetches on every contribution
- contribute.ts routing block skips dead-letter when infrastructureError
  is true — only dead-letters on actual delivery rejections
- PublishResult and RouteResult propagate the flag through the chain
- NexusWsBridge accepts handoffStore + ipcClient for SSE delivery
  tracking and DRY send path
- 3 new integration tests: infra error skip, delivery rejection,
  endpoint caching

This fixes the false dead_lettered status seen in TUI e2e when Nexus
VFS is available but the IPC endpoint is not.
…t writes, SSE race, silent errors

Round 1 fixes from Codex adversarial review:

1. Transient IPC outage no longer permanently cached (HIGH)
   - Only 404/405 permanently disable endpoint (doesn't exist)
   - 502/503/network errors use 30s backoff, then retry
   - Prevents IPC blackhole after brief Nexus restart

2. Concurrent handoff status writes validated (HIGH)
   - NexusHandoffStore.transitionHandoff() checks canTransition()
     inside readModifyWrite, rejects stale transitions
   - Prevents concurrent SSE delivery + ack from clobbering state

3. SSE delivery race with ipcMessageId write (HIGH)
   - updateHandoffDeliveryStatus falls back to matching by
     (toRole, pending/delivered, most recent) when ipcMessageId
     hasn't been written yet by the fire-and-forget routing block

4. Silent IPC bookkeeping errors now logged (MEDIUM)
   - contribute.ts catch block logs handoff ID, target role, and
     error via console.warn instead of silently discarding
…SE matching, expiry scope

1. All non-2xx IPC responses are infrastructureError (HIGH)
   - Only explicit delivery rejections (future: 2xx with reject body) dead-letter
   - 429/500/401/403 are retryable, not permanent failures

2. No-op transitions skip write entirely (HIGH)
   - transitionHandoff reads, validates canTransition, writes only on valid change
   - Prevents stale snapshot from clobbering concurrent updates

3. SSE fallback constrained by sender + unlinked filter (HIGH)
   - Matches by fromRole===sender and !ipcMessageId to avoid cross-matching
   - Prevents wrong handoff correlation under concurrent delivery

4. Transient backoff cleared on success (MEDIUM)
   - A successful send proves endpoint is healthy, clears the 30s backoff
   - Prevents mixed-outcome batches from causing process-wide outage

5. expireStale covers delivered + processed states (MEDIUM)
   - All three store implementations updated: pending_pickup, delivered,
     processed are now expirable per the state machine
   - Conformance test updated to match
@windoliver windoliver merged commit 07b6e29 into main Apr 15, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: route topology handoffs via Nexus IPC and map IPC lifecycle into Grove

1 participant