Skip to content

fix(api): Store is the sync boundary, not a pointer leak#27

Merged
arcaven merged 1 commit intomainfrom
fix/store-sync-contract
Apr 19, 2026
Merged

fix(api): Store is the sync boundary, not a pointer leak#27
arcaven merged 1 commit intomainfrom
fix/store-sync-contract

Conversation

@arcaven
Copy link
Copy Markdown
Collaborator

@arcaven arcaven commented Apr 19, 2026

Summary

  • Closes the race go test -race caught in CI run 24617740013: team controller's evaluateHealth mutating *Session concurrently with daemon's handleGet marshalling the same object.
  • Fixes the class of bug, not the specific pair. Store.{Get,List}* now return value copies; mutations go through Store.{UpdateSession,UpdateTeam}(key, func(*T) error) which take the write lock. Pointers to internal objects never escape.
  • Audits every writer: session.Manager.Create/ReapDead, team.Controller.evaluateHealth/applyRestartPolicy/restartSession/InitiateShift/shift handlers, api.Manifest.Apply (silent cousin race), daemon.handleScale (another silent cousin).
  • Adds .claude/rules/go.md Rule 12 — never return internal pointers from a locked accessor, with examples.

See orc finding-032-store-sync-contract-leak for the class-of-bug writeup + recommendations.

Refs: aae-orc-ogto

Test plan

  • just test-race — all 17 packages pass
  • just lint — 0 issues
  • just fmt — clean
  • TestHandleApplyRuntimePreflight (the original failure) green under -race
  • All shift/health lifecycle tests green under -race

CI run 24617740013 caught a data race in TestHandleApplyRuntimePreflight:
team.Controller.evaluateHealth was mutating sess.HealthState through a
pointer returned by Store.ListSessions while daemon.handleGet was
json.Marshal'ing the same *Session. The mutex protected the map; it
did not protect the values.

Fix the contract, not the one pair:

- Store.{Get,List}* return value copies (with cloned slices for Team
  and Runtime). Pointers to internal objects never escape.
- New Store.{UpdateSession,UpdateTeam}(key, func) take the write lock
  and mutate the live object in place. UpdateSessionHeartbeat stays as
  a convenience helper.
- session.Manager.Create commits PaneID+State=Running via UpdateSession
  after pane spawn (was mutating the caller's pointer, which used to
  alias the store's live object).
- session.Manager.ReapDead routes the Crashed+PaneID="" mutation
  through UpdateSession.
- team.Controller.evaluateHealth opens an UpdateSession closure per
  session so the entire health evaluation runs under the write lock.
- team.Controller.applyRestartPolicy and restartSession commit State
  transitions through UpdateSession.
- team.Controller.InitiateShift / reconcileShift / shiftLaunch /
  shiftDrain route Team.Generation and Team.Shift mutations through
  UpdateTeam.
- api.Manifest.Apply routes the "update roles on existing team" path
  through UpdateTeam (another silent race site).
- daemon.handleScale routes the Roles[i].Replicas = N mutation through
  UpdateTeam.
- Tests that mutated LastHeartbeat on snapshot results now call
  UpdateSession; tests that mutated Team.Roles[i].Replicas call
  UpdateTeam; tests that read team.Shift state fetch fresh via
  store.GetTeam.
- .claude/rules/go.md — new Rule 12: never return internal pointers
  from a locked accessor, with correct/incorrect examples. References
  orc finding-032.

All packages pass go test -race. golangci-lint clean.

Refs: aae-orc-ogto, orc finding-032
@arcaven arcaven merged commit b8f1f4b into main Apr 19, 2026
7 checks passed
@arcaven arcaven added type.bug Broken behavior — something doesn't work as designed type.regression Previously worked, now broken (latent or new) agent.worker PR created by a Claude Code worker agent area.store api.Store labels Apr 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent.worker PR created by a Claude Code worker agent area.store api.Store type.bug Broken behavior — something doesn't work as designed type.regression Previously worked, now broken (latent or new)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant