Skip to content

fix(tui,mcp,acpx): production-ready coder→reviewer loop via Nexus IPC#239

Merged
windoliver merged 5 commits intomainfrom
worktree-ancient-nibbling-anchor
Apr 11, 2026
Merged

fix(tui,mcp,acpx): production-ready coder→reviewer loop via Nexus IPC#239
windoliver merged 5 commits intomainfrom
worktree-ancient-nibbling-anchor

Conversation

@windoliver
Copy link
Copy Markdown
Owner

Summary

End-to-end fix for the full review-loop path: Grove TUI → acpx(claude) → grove MCP stdio → Nexus contribution store → TopologyRouter handoff → reviewer IPC push → grove_submit_review → reviewer→coder handoff.

Validated locally: both coder→reviewer and reviewer→coder handoffs delivered in the TUI Handoffs panel, the hello.txt loop completes via native mcp__grove__grove_cas_put / grove_submit_work / grove_submit_review tool calls — no curl fallback, no SQLite split-brain.

Root causes fixed

  1. acpx DEFAULT_AGENT hardcoded to "codex" — every role spawned as codex regardless of the launch-preview CLI mapping. AcpxRuntime.spawn() now derives the agent from AgentConfig.command.
  2. Grove wrote .mcp.json but acpx reads .acpxrc.json — totally incompatible schemas. spawn-manager now writes both files per spawn.
  3. acpx 0.3.1 had a broken claude-agent-acp adaptersession/new failed with "Query closed before response received". Requires >= 0.5.3 now; startup check surfaces an upgrade hint.
  4. grove MCP server threw on Nexus-mode startupcreateLocalRuntime({parseContract: true}) hit "Session has no stored config" because the session record lives in Nexus VFS, not local SQLite. serve.ts now skips the local parse in nexus mode and loads the contract from NexusSessionStore.getSession.
  5. Nexus session record had no frozen contracttoSessionResponse() stripped the config field. Response and mapApiSession now preserve it so the mirrored Nexus record carries the full GroveContract for real enforcement.
  6. serve-http.ts was SQLite-only — the HTTP MCP transport ignored GROVE_NEXUS_URL and bypassed handoff routing. Now mirrors serve.ts: Nexus stores + NexusHandoffStore + TopologyRouter when GROVE_NEXUS_URL is set. Reads GROVE_SESSION_ID from .grove/current-session.json (TUI writes this file on setSessionScope).
  7. spawn-manager didn't pass GROVE_NEXUS_URL to agent MCP servers — when the TUI was launched without the env var (common — nexusUrl lives in grove.json), the spawned grove MCP subprocess fell back to local SQLite and the TUI never saw contributions in the Nexus feed. Added a grove.json fallback lookup.
  8. nexus-lifecycle GROVE_NEXUS_URL priority — env var was last in the candidateUrls list so a healthy-looking but broken Nexus would always win. Moved to first position.
  9. tmux-runtime wrong socketlistSessions and capture-pane hit the default tmux server while spawn uses -L grove. They were talking to different daemons; grove's own sessions were invisible to its listSessions.
  10. codex mcp registration blocked every spawn ~3scodex mcp remove + codex mcp add were synchronous in writeMcpConfig. Moved to fire-and-forget (claude reads .acpxrc.json directly; codex reads its config lazily).

Other fixes

  • ApiSessionResponse now carries config through mapApiSession.
  • tests/presets/preset-e2e-nexus.test.tsrpc() sends NEXUS_API_KEY auth header when set and surfaces non-2xx bodies. Dropped the broken swarm-ops contribution assertion (the preset enforces a has_relation gate that a bare CLI contribute can't satisfy; seed data was removed in 6da5494 without updating this expectation).
  • src/tui/spawn-manager.test.tssetDefaultTimeout(30_000) for tests that do real git worktree I/O.
  • src/core/acpx-runtime.test.ts — regex now matches the new session-name format (grove-<role>-<counter>-<timestamp>), timeout bumped to 30s for cold npx fetches.

Validation

  • bun x tsc --noEmit — clean.
  • bun x biome check — clean on changed files (4 pre-existing warnings untouched).
  • bun test4878 pass, 0 fail (was 4873 pass / 5 fail before).
  • Live E2E: grove up → new session → review-loop → "write hello world in hello.txt" → claude coder calls mcp__grove__grove_cas_put + grove_submit_workroutedTo:["reviewer"] + handoffIds → IPC push → claude reviewer calls grove_submit_review + grove_done → TUI Handoffs panel shows 2 total, 0 pending, both delivered.

Requires

acpx >= 0.5.3npm install -g acpx@latest. The runtime throws a clear upgrade hint if older.

Test plan

  • Fresh grove up on a worktree with grove.json that has nexusUrl — verify the feed shows a coder contribution and Handoffs panel shows both coder→reviewer and reviewer→coder deliveries.
  • acpx --version < 0.5.3 — verify the runtime surfaces the upgrade hint and refuses to spawn.
  • NEXUS_URL=http://... NEXUS_API_KEY=... bun test tests/presets/preset-e2e-nexus.test.ts — all 14 pass.
  • bun test on main — 4878 pass / 0 fail.

End-to-end fix for the full review-loop path: Grove TUI → acpx(claude) →
grove MCP stdio → Nexus contribution store → TopologyRouter handoff →
reviewer IPC push → grove_submit_review → reviewer→coder handoff.

Validated locally: both coder→reviewer and reviewer→coder handoffs
delivered in the TUI Handoffs panel, hello.txt loop completes via native
`mcp__grove__grove_cas_put` / `grove_submit_work` / `grove_submit_review`
tool calls — no curl fallback, no SQLite split-brain.

Root causes fixed:

1. acpx DEFAULT_AGENT hardcoded to "codex"
   src/tui/main.ts + src/core/acpx-runtime.ts
   Every role was spawned as codex regardless of the launch-preview CLI
   mapping. Now AcpxRuntime.spawn() extracts the agent (claude/codex/
   gemini/pi/openclaw) from AgentConfig.command and passes it through to
   `acpx <agent>`. main.ts drops the hardcoded override.

2. Grove wrote .mcp.json but acpx reads .acpxrc.json
   src/tui/spawn-manager.ts
   acpx (even in v0.5.3) does not read claude-code's .mcp.json format;
   it expects an .acpxrc.json with mcpServers as an array and env as
   [{name,value}]. spawn-manager now writes both files on every spawn
   and protects .acpxrc.json from agent mutation.

3. acpx 0.3.1 had a broken claude-agent-acp adapter
   src/core/acpx-runtime.ts
   0.3.x pulled @zed-industries/claude-agent-acp@^0.21.0 which failed
   session/new with "Query closed before response received". 0.5.3
   switched to @agentclientprotocol/claude-agent-acp@^0.25.0. Added a
   startup version check that surfaces a clear upgrade hint.

4. grove MCP server threw on Nexus-mode startup
   src/mcp/serve.ts
   createLocalRuntime({parseContract: true}) tried to load the session
   config from local SQLite, which doesn't have the record in Nexus
   mode — throwing "Session has no stored config" and killing the MCP
   server. serve.ts now skips the local parse when GROVE_NEXUS_URL is
   set and reconstructs the contract from NexusSessionStore.getSession.

5. Nexus session record had no frozen contract
   src/server/routes/sessions.ts + src/tui/provider-shared.ts +
   src/mcp/serve.ts
   toSessionResponse() stripped the `config` field before returning, so
   the mirrored Nexus session only had `topology`. serve.ts now loads
   the authoritative GroveContract from the session record, giving real
   rate-limit enforcement in nexus mode instead of a minimal stub.

6. serve-http.ts was SQLite-only
   src/mcp/serve-http.ts
   The HTTP MCP transport ignored GROVE_NEXUS_URL entirely and wrote
   every contribution to the local grove.db. Bypassed handoff routing
   for any caller hitting :4015/mcp. Now mirrors serve.ts: Nexus stores
   + NexusHandoffStore + TopologyRouter when GROVE_NEXUS_URL is set.
   Reads GROVE_SESSION_ID from .grove/current-session.json when the
   env var is unavailable (serve-http.ts is spawned before sessions
   are created).

7. TUI didn't write current-session.json
   src/tui/screens/screen-manager.tsx
   Added a best-effort write on setSessionScope so serve-http.ts can
   pick up the session ID without restarting.

8. spawn-manager didn't pass GROVE_NEXUS_URL to agent MCP servers
   src/tui/spawn-manager.ts
   writeMcpConfig built mcpEnv from process.env but the TUI was often
   launched without GROVE_NEXUS_URL in env (nexusUrl lives in
   grove.json). Added a fallback that reads this.groveDir/grove.json
   when the env var is missing. Without this, the spawned grove MCP
   subprocess fell back to local SQLite and the TUI never saw
   contributions in the Nexus feed.

9. nexus-lifecycle GROVE_NEXUS_URL priority
   src/cli/nexus-lifecycle.ts
   Env var was last in the candidateUrls list so a healthy-looking but
   broken Nexus would always win. Moved to first position.

10. tmux-runtime wrong socket
    src/core/tmux-runtime.ts
    listSessions() and capture-pane hit the default tmux server; spawn
    uses `-L grove`. They were talking to different daemons — sessions
    created by grove were invisible to grove's own listSessions. Added
    `-L grove` to both commands.

11. codex mcp registration blocked every spawn ~3s
    src/tui/spawn-manager.ts
    `codex mcp remove` + `codex mcp add` in writeMcpConfig were synchronous
    and added ~3s per spawn, blocking the spawn chain on machines where
    codex is installed. Moved to fire-and-forget since the claude path
    reads .acpxrc.json directly and codex reads its config lazily.

Other fixes:
- src/tui/provider-shared.ts: ApiSessionResponse now includes `config`
- tests/presets/preset-e2e-nexus.test.ts: rpc() sends NEXUS_API_KEY
  auth header when set, surfaces non-2xx bodies. Dropped broken
  swarm-ops contribution assertion (preset enforces a has_relation
  gate that a bare CLI contribute can't satisfy; seed data was removed
  in 6da5494 without updating this expectation).
- src/tui/spawn-manager.test.ts: setDefaultTimeout(30_000) for spawn
  tests that do real git worktree I/O.
- src/core/acpx-runtime.test.ts: regex matches the new session-name
  format (grove-<role>-<counter>-<timestamp>), timeout bumped to 30s.

Test coverage: `bun test` → 4878 pass, 0 fail (was 4873 / 5 fail).
`bun x tsc --noEmit` clean. `bun x biome check` clean on changed files.

Requires: acpx >= 0.5.3 (`npm install -g acpx@latest`).
Ran 8 rounds of adversarial review against the previous commit's session
bootstrap / contract loading / spawn ordering paths. Each round surfaced
real correctness gaps; this commit lands the fixes for all of them.

Hardening summary by file:

src/mcp/serve.ts
- Fail closed on unreachable Nexus at startup instead of silently
  downgrading to local stores (would otherwise split-brain writes with
  the TUI).
- Retry Nexus session lookup with exponential backoff before declaring
  the record missing, to absorb the TUI → Nexus mirror race.
- Accept topology-only session records (legacy / grove_create_session
  tool) with a loud WARN by DEFAULT; opt into strict `config`-only mode
  via GROVE_MCP_STRICT_CONTRACT=1. Prevents backcompat breakage while
  giving ops a knob to tighten enforcement.
- Exit the process when a session ID is set but Nexus returns no record
  at all (vs a topology-only record); the old code path kept writing to
  orphan VFS paths.

src/mcp/serve-http.ts
- Lazily resolve session-scoped deps per incoming HTTP request instead
  of baking sessionId into stores at process boot. serve-http.ts is
  started before the interactive session exists, so the previous static
  scoping was always wrong for fresh clients.
- Invalidate MCP sessions whose bound grove sessionId no longer matches
  the current one so a grove session switch doesn't leave long-lived
  HTTP clients writing under the prior scope.
- Reject new mutating MCP initialization in Nexus bootstrap mode (no
  grove session yet) with 503 SESSION_NOT_READY. Bootstrap-scoped
  McpServers would capture unscoped stores for their entire lifetime.
- Differentiate "file missing" (valid pre-session state) from "file
  unreadable/corrupt" via SessionStateReadError, and only invalidate
  live sessions after a SUCCESSFUL read. Atomic writes on the TUI side
  (tmp + rename) and a read-error-tolerant handler keep concurrent
  session switches from tearing down in-flight requests.
- Update the current-session.json mtime cache only AFTER a successful
  parse; a torn-write read no longer poisons the cache.
- Mirror the strict/weak contract policy from serve.ts.

src/tui/spawn-manager.ts
- Serialize codex MCP registration through a per-session promise chain
  on the SpawnManager instance. Parallel role spawns in the same session
  previously raced remove+add on the single global "grove" entry.
- Use stable "grove" MCP name (not per-session) to avoid leaking stale
  entries and persisted secrets in ~/.codex/config.toml over time.
  Sweep any legacy `grove-*` entries on first spawn.
- Replace execSync shell strings with spawnSync argv arrays in the
  codex mcp add/remove path so paths with spaces / shell metacharacters
  can't break the command.
- Only cache successful registrations; a failed first attempt clears
  the cache so the next spawn retries.
- Probe `codex --version` once and skip registration entirely when
  codex is not installed (claude path reads .acpxrc.json directly).
- writeMcpConfig swallows codex-registration errors softly so claude
  roles aren't blocked, but spawn() re-calls ensureCodexMcpRegistered
  with the real error propagated when the role is specifically codex.

src/tui/nexus-provider.ts
- Expose `mode = "nexus" as const` as a public discriminator so callers
  can distinguish Nexus from local providers without peeking at
  protected fields.
- Await the Nexus session mirror in createSession and retry a few
  times; on final failure, archive the just-created local/server
  session so repeated retries don't accumulate orphan active records.
  Throws so the TUI can surface the error instead of returning a
  broken session id.

src/tui/screens/screen-manager.tsx
- Refuse the synthetic-UUID fallback on Nexus session-creation failures
  (uses the new provider.mode discriminator); the stdio MCP server now
  fails closed when the session record is missing, so a fabricated ID
  would immediately crash every spawned agent.
- Write .grove/current-session.json on BOTH new-session and resume
  paths so the HTTP MCP server's session scoping stays fresh across
  restarts and resumes.
- Atomic writes via temp-file + rename so concurrent readers in
  serve-http.ts never observe truncated JSON during a session switch.

src/cli/main.tui-dispatch.test.ts
- Pre-existing flake: bun cold-start of main.ts exceeds the hardcoded
  2s kill deadline on warm machines. Bumped inner timeout to 5s and the
  outer bun:test timeout to 15s so the race finishes cleanly.

Test coverage: `bun test` → 4878 pass, 0 fail. `bun x tsc --noEmit`
clean. `bun x biome check` clean on changed files.
Lands the safe subset of the sibling WIP that was in the worktree
alongside the Nexus/MCP fixes. Two themes:

1. Debug log hygiene
   - src/tui/debug-log.ts: add `ENABLED` gate keyed off GROVE_DEBUG=1
     so the unconditional `/tmp/grove-debug.log` writes in hot paths
     become no-ops in normal operation. Before: every NexusHandoffStore
     read/write and every NexusContributionStore put/list wrote a
     timestamped line. After: only when GROVE_DEBUG=1 is set.
   - src/nexus/nexus-handoff-store.ts: replace 6× inlined
     `appendFileSync("/tmp/grove-debug.log", ...)` blocks with calls
     to the shared `debugLog()` helper. Net -53 lines of boilerplate
     and one consistent gate instead of six unguarded writes.
   - src/nexus/nexus-contribution-store.ts: same cleanup; net -22
     lines. Reuses the existing `manifestPath` local rather than
     recomputing it inside the log.

2. contributeOperation handoff fan-out improvements
   - src/core/operations/contribute.ts: the `createMany` fast path
     already existed; the per-input fallback was serial and swallowed
     every error silently. Switch the fallback to `Promise.allSettled`
     so N downstream handoffs pay 1×RTT instead of N×RTT, and
     surface per-failure reasons via `console.warn` so operators can
     diagnose routing gaps instead of noticing handoffs missing from
     the UI without any log trail. The best-effort semantics are
     preserved: a failure does not abort the contribution that was
     already committed.
   - src/core/operations/contribute.test.ts: add a describe block
     exercising the serial (fallback) path with an injected faulty
     handoff store. Verifies (a) the contribution is still durably
     persisted when every handoff create throws and (b) a single
     `console.warn` fires for the failed batch.

3. Doc-only clarifications
   - src/local/sqlite-store.ts: extend the `putWithCowrite` JSDoc to
     spell out the duck-type capability contract used by
     contributeOperation's `writeAtomic` selector.
   - src/local/sqlite-handoff-store.ts: matching comment on
     `insertSync` pointing at the same selector rule.

Explicitly NOT included in this commit (separate issue to follow):
`src/core/operations/bounty.ts` and `src/core/operations/bounty.test.ts`.
Three codex adversarial-review rounds flagged the new compensation
pattern there (settle-before-pay corruption + post-commit reservation
rollback + post-commit claim release). Those fixes need a saga /
outbox / reconciler design decision, not a mechanical change, so
they stay in the worktree until the design is resolved.

Test coverage: `bun test` (with NEXUS env) → 4878 pass, 0 fail.
`bun x tsc --noEmit` clean. `bun x biome check` clean on changed files.
windoliver added a commit that referenced this pull request Apr 11, 2026
…ED ON #240)

**DO NOT MERGE** — this is a draft PR tracking in-progress work on bounty
operation compensation. Three HIGH/CRITICAL correctness bugs have been
flagged by adversarial review across three independent rounds; see #240
for the findings, design options, and recommended next steps.

## What this adds

- `createBountyOperation`: try/catch that voids the credit reservation
  when `bountyStore.createBounty()` throws, so escrow is returned on
  creation failure.
- `claimBountyOperation`: pre-flight check that only `open` bounties
  can be claimed, plus try/catch that releases the claim record when
  `bountyStore.claimBounty()` throws so other agents aren't blocked
  until the lease expires.
- `settleBountyOperation`: reordered to persist `completed → settled`
  state transitions BEFORE calling `creditsService.capture()`, with a
  comment explaining the intent (make the failure recoverable by retry).
- Test coverage for each compensation path using mock stores.
- Extract `MISSING_BOUNTY_STORE` constant for the repeated error
  message.

## Known bugs (see #240 for details)

1. **[CRITICAL]** Settlement now commits terminal `settled` state
   before `capture()` runs. A capture failure leaves the bounty paid
   according to the state machine but with no funds actually
   transferred, and there is no reconciler in the codebase to repair
   it. src/core/operations/bounty.ts:319-333

2. **[HIGH]** The `createBounty` catch voids the reservation on any
   error, including post-commit failures from the NexusBountyStore
   two-step write (bounty doc first, then status index). A late
   failure leaks an "open" bounty with no escrow backing it.
   src/core/operations/bounty.ts:156-176

3. **[HIGH]** Same shape on `claimBounty`: the compensation releases
   the claim even when the bounty state transition actually persisted,
   leaving the bounty stuck in `claimed` pointing at a released claim.
   src/core/operations/bounty.ts:258-269

## Design options for the fix

See #240 — three paths, roughly:
- **(a)** Full saga: `pending_settlement` state + reconciler + CAS
  confirmation on post-commit failures. ~200-500 lines, 1-2 days.
- **(b)** CAS confirmation only: re-read before compensating so we
  don't rollback committed state. Doesn't fix the settle-before-pay
  race. ~100-200 lines.
- **(c)** Revert the compensation refactor, keep only the cosmetic
  changes (constant extract + pre-flight claim check). Relies on
  reservation/lease expiry for recovery. ~50-100 lines.

The issue recommends starting with (c) to stop the bleed, then doing
(a) as a proper saga-log project with its own RFC.

## Why open this as a draft PR now

This code existed in an uncommitted worktree alongside the
`fix(tui,mcp,acpx)` work in PR #239. Rather than leave it in a
limbo state where it can be accidentally restaged on the next push,
this branch makes it visible, reviewable, and explicitly linked to
the design issue it depends on.

Do not merge until #240 has a resolved design and the code matches.
CI caught two lint errors that `bun x biome check` didn't surface when
I ran it against only the changed files in the previous commits.
Running the full `bun run check` the way CI does reproduced both.

- src/core/acpx-runtime.test.ts: previous auto-fix left the `test()`
  call with the 30s timeout arg on a broken trailing-comment line that
  biome's formatter rejected on the second pass. Rewrote to the
  canonical 3-arg form with a single blank line between the arrow
  body and the timeout.
- src/tui/spawn-manager.test.ts: the `setDefaultTimeout(30_000)` call
  I added in a prior round split the import block in two, which
  biome's `assist/source/organizeImports` rule flagged. Added a blank
  line between the call and the subsequent type imports so the
  import group is contiguous.

No behavior change. `bun run check` is clean now (16 pre-existing
warnings unrelated to my files). Full test suite still 4870 pass,
0 fail with NEXUS env set.
- contribute.ts: wrap parallel handoff create() in Promise.resolve().then()
  so a synchronous throw becomes a rejected promise instead of escaping
  .map() and bypassing allSettled. Without this, a sync throw from a
  HandoffStore extension would surface as an operation error after the
  contribution was already committed, releasing the idempotency slot and
  allowing duplicate contributions on retry.
- contribute.test.ts: regression test with a HandoffStore whose create()
  throws synchronously and no createMany (forces the fallback path).
- spawn-manager.ts: forward GROVE_DEBUG to spawned MCP agents via both
  .mcp.json/.acpxrc.json env and the codex mcp registration. debugLog()
  in NexusContributionStore/NexusHandoffStore reads the env var at module
  load inside the child grove-mcp process, which does not inherit env
  from the parent TUI shell — so without this passthrough turning on
  GROVE_DEBUG=1 in the parent never re-enabled agent-side traces.
@windoliver windoliver merged commit 33d26fc into main Apr 11, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant