Skip to content

fix(bounty): multi-store atomicity, saga settlement, reconciler framework#254

Merged
windoliver merged 14 commits intomainfrom
fix/bounty-atomicity-phase1
Apr 15, 2026
Merged

fix(bounty): multi-store atomicity, saga settlement, reconciler framework#254
windoliver merged 14 commits intomainfrom
fix/bounty-atomicity-phase1

Conversation

@windoliver
Copy link
Copy Markdown
Owner

Summary

Closes #240. Addresses the three codex-flagged correctness bugs in bounty operations and adds infrastructure to prevent recurrence.

  • Saga-based settlement: new pending_settlement pivot state ensures settleBountyOperation never enters a terminal state before capture() confirms. Resumable from any intermediate state (pending_settlement, completed) by both the operation and the background reconciler.
  • SweepReconciler framework: pluggable sweep strategies run on a 60s timer in all three runtimes (HTTP server, stdio MCP, HTTP MCP). Ships with BountyIndexSweep (dual-write index repair), SettlementSweep (resume stalled settlements), and HandoffSweep (detect orphaned contributions).
  • Claim renewal: same agent can extend a bounty claim lease without reopening the bounty, preventing long-running work from getting stranded on lease expiry.
  • State machine hardening: claimed → completed bypass removed — all settlement must go through pending_settlement. validateBountyTransition enforced in NexusBountyStore before every CAS write.
  • Input safety: amount > 0, non-empty title, pre-flight status checks, frozen fulfillment CID on retry, orphaned claim release on pre-commit failure.

What changed

Area Files Change
Saga settlement bounty.ts, bounty-logic.ts, bounty-store.ts, both store impls pending_settlement state, beginSettlement(), 3-step settle flow
Reconciler sweep-reconciler.ts, bounty-index-sweep.ts, settlement-sweep.ts, handoff-sweep.ts Framework + 3 strategies
Runtime wiring server/serve.ts, mcp/serve.ts, mcp/serve-http.ts Reconciler started + stopped in all entry points
Claim renewal operations/bounty.ts Same-agent renewal path in claimBountyOperation
Schema propagation schemas.ts, mcp/tools/bounties.ts pending_settlement added to all Zod enums
Tests bounty.test.ts, sweep-reconciler.test.ts, failing-bounty-store.ts 120 tests: acceptance criteria, validation matrix, conflict scenarios, sweep strategies

Adversarial review

6 rounds of Codex adversarial review. 12 findings fixed (1 critical, 8 high, 1 medium). Key fixes:

  • Frozen fulfillment CID prevents non-deterministic settlements on retry
  • Process-local bounty cache removed — mutations always read fresh from VFS
  • completed bounties recoverable (capture already happened, just advance to settled)
  • SettlementSweep hard-fails on escrowed bounties without CreditsService
  • State machine enforces pending_settlement as mandatory pivot

Test plan

  • 120 tests pass across 4 test files (bun test)
  • Type check clean (bun run check — 2 pre-existing warnings only)
  • AC1: capture() throws after state transitions → bounty retryable from pending_settlement
  • AC2: createBounty() post-commit throw → reservation NOT voided
  • AC3: claimBounty() post-commit throw → claim NOT released (or released only on pre-commit)
  • Same-agent claim renewal succeeds, different-agent rejected
  • SettlementSweep resumes stalled pending_settlement bounties to settled
  • SettlementSweep handles completed bounties (post-capture, just needs settleBounty)
  • Sequential conflict tests: double-claim, double-settle, claim-after-settle
  • Non-escrowed bounty full lifecycle (no credits service)
  • 6 rounds adversarial review — no remaining HIGH/CRITICAL on bounty code

Follow-up

…work (#240)

Addresses the three codex-flagged correctness bugs in bounty operations
and adds infrastructure to prevent recurrence:

- Add pending_settlement saga pivot state so settleBountyOperation never
  enters a terminal state before capture() confirms
- Add pre-flight status checks in claim/settle to prevent wasted side effects
- Add input validation (amount > 0, non-empty title) at operation boundary
- Add LRU doc cache + ETag-forwarding in NexusBountyStore to reduce VFS
  round-trips in multi-transition flows
- Add SweepReconciler framework with pluggable strategies for periodic
  consistency repair
- Add BountyIndexSweep (dual-write index repair), SettlementSweep
  (resume stalled pending_settlement), HandoffSweep (detect orphans)
- Add FailingBountyStore test wrapper for partial-failure injection
- Add lazy eviction of expired reservations in InMemoryCreditsService

115 tests pass across 4 test files including all 3 acceptance criteria
from Issue #240.
1. Freeze fulfillment CID after saga pivot: when resuming a
   pending_settlement bounty, reject attempts to change the
   contribution CID. Prevents non-deterministic settlements.

2. Remove stale cache from transitionBounty: mutations always
   read fresh from VFS to get a valid ETag. Cache is still used
   for read-only getBounty() pre-flight checks.

3. Wire SweepReconciler into server startup: BountyIndexSweep
   and SettlementSweep now run on a 60s timer with graceful
   shutdown. Closes the "recovery not wired" gap.
1. Remove process-local bounty cache entirely: mutable objects must
   not be cached without cross-process invalidation. getBounty() now
   always reads fresh from VFS. Add validateBountyTransition() call
   in transitionBounty() to reject stale state before CAS write.

2. Extend settlement recovery to handle "completed" status: if capture
   succeeded and completeBounty committed but settleBounty failed, the
   operation and SettlementSweep can now resume from "completed" state.
   Prevents stranded post-capture bounties.
… 1 MEDIUM

1. [critical] SettlementSweep hard-fails when bounty has reservationId
   but no creditsService — prevents settling escrowed bounties without
   actually capturing funds.

2. [high] Remove claimed→completed from state machine — force all
   settlement through pending_settlement pivot. Update conformance
   tests and bounty-logic tests to use beginSettlement first.

3. [medium] BountyIndexSweep now calls repairIndex unconditionally for
   every bounty — cleans both missing current-status entries AND stale
   old-status markers.
1. Remove SettlementSweep from server startup: local runtime has no
   CreditsService, so the sweep would hard-fail on escrowed bounties.
   Only BountyIndexSweep is registered. Settlement sweep will be enabled
   when a production CreditsService is wired in.

2. Release orphaned claims on bounty transition failure: if claimBounty()
   fails, re-read the bounty and release the claim only if the bounty is
   still open (confirming the transition didn't commit). Post-commit
   failures keep the claim for consistency.
1. Add pending_settlement to Zod schemas in core/schemas.ts and
   mcp/tools/bounties.ts — prevents parsers from rejecting bounties
   in the new pivot state.

2. Re-enable SettlementSweep in server: it safely recovers non-escrowed
   bounties (no reservationId). Escrowed bounties log an error and wait
   for CreditsService. Update doc comment in bounty.ts lifecycle.
SettlementSweep: completed bounties have already captured — skip the
creditsService requirement and just advance to settled. Only
pending_settlement bounties need the capture step.

Remaining findings (out of scope for #240):
- Claim renewal/heartbeat path: pre-existing design gap, not introduced
  by this branch. Tracked separately.
- Nexus MCP sweep wiring: requires architectural changes to MCP server
  startup. Tracked as follow-up integration work.
1. Same-agent claim renewal: claimBountyOperation now allows the
   current claim holder to extend their lease without reopening the
   bounty. Different agents are still rejected. Prevents long-running
   bounties from getting stranded when the claim lease expires.

2. Wire SweepReconciler into both MCP entry points:
   - serve.ts (stdio): starts BountyIndexSweep + SettlementSweep
     after store setup, stops on shutdown
   - serve-http.ts (HTTP): starts at process level using zone-scoped
     Nexus bounty store (not session-scoped), stops on shutdown

   The reconciler now runs in all three runtimes that can create
   bounties: HTTP server, stdio MCP, and HTTP MCP.
1. [high] Claim renewal with expired lease: detect if existing claim
   is expired and create a fresh claim ID instead of reusing the stale
   one. Rebinds the bounty to the new claim atomically.

2. [high] Remove SettlementSweep from MCP runtimes: no CreditsService
   available, escrowed bounties would fail every cycle. Only
   BountyIndexSweep registered. Settlement recovery deferred to #253.

3. [medium] BountyIndexSweep now detection-based: queries status-filtered
   lists to find actual drift, only calls repairIndex when missing.
   No more unconditional rewrite of every healthy bounty each cycle.
1. [high] Claim rebind after lease expiry: allow claimed→claimed
   self-transition so expired claim IDs get rotated to fresh ones.
   The bounty record is atomically rebound to the new claim.

2. [high] Re-enable SettlementSweep in MCP runtimes: completed
   bounties (already captured) can settle without CreditsService.
   Only pending_settlement+reservationId cases log errors.

3. [medium] repairIndex version-aware: re-reads with ETag before
   deleting stale markers. Skips cleanup if a concurrent transition
   changed the bounty between read and delete.
1. [high] Claim renewal checks lease validity (not just status): only
   reuse claimId if both status=active AND leaseExpiresAt > now.

2. [high] Compensation on rotated-claim rebind failure: release the
   orphaned new claim if bountyStore.claimBounty throws.

3. [medium] Remove sweep reconciler from stdio MCP: per-agent processes
   must not run zone-wide sweeps (N×load, CAS conflicts). Sweeps run
   only in HTTP server + HTTP MCP (singleton processes).
1. [high] Rebind compensation re-reads bounty: only releases the new
   claim if the bounty didn't commit the rebind (post-commit safety).

2. [medium] Serialized sweep cycles: in-flight guard prevents
   overlapping async cycles from contending with each other.

3. [medium] BountyIndexSweep stale-marker detection: known limitation —
   listBounties(status) filters stale entries before the sweep sees
   them. Full fix requires a raw index listing API (store-layer change).
   repairIndex handles cleanup when triggered by other paths.
All three "persistent" findings from the review loop are now fixed using
existing Nexus VFS operations — no Nexus changes needed.

1. repairIndex race: check exists() before each stale marker delete,
   re-read the authoritative document right before deleting to confirm
   the bounty hasn't transitioned TO that status concurrently.

2. BountyIndexSweep stale-marker detection: new listIndexStatuses()
   method on NexusBountyStore checks which status index entries actually
   exist using client.exists(). Sweep now detects both missing current
   entries AND stale old-status entries in a single pass.

3. Added listIndexStatuses to BountyStore interface (optional) and
   FailingBountyStore wrapper.
@windoliver windoliver merged commit b876ed2 into main Apr 15, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bounty multi-store atomicity + saga (absorbs #227): compensation bugs, handoff reconciler, settle-before-pay

1 participant