Fresh 1.13.1 install: AC startup race → triple auto-reset → ghost agent identities (head-N, re1-N) + Restart button leaves AC dead

# Fresh 1.13.1 install: AC startup race → triple auto-reset cascade → ghost agent identities + Restart button leaves AC dead

## Context
Builds on #577 (--config flag fix, shipped in 1.13.1). The argparse error is gone — AC now starts and binds 8300. But on a clean install (`rm -rf ~/.quadwork ~/.npm/_npx/*/*/quadwork`, then `npx quadwork@latest start` → create project via dashboard), three layered issues surface that compound into a broken chat experience.

## Symptom 1 — startup race triggers triple auto-reset
Server log on first project creation:
```
QuadWork server listening on http://127.0.0.1:8400
[health] AC health monitor started (30s interval)
Cloning into '/Users/<user>/Projects/<project>'...
Preparing worktree (checking out 'worktree-head')
Preparing worktree (checking out 'worktree-re1')
Preparing worktree (checking out 'worktree-re2')
Preparing worktree (checking out 'worktree-dev')
[snapshot] <project> history fetch returned 502; skipping snapshot
[snapshot] <project> history fetch returned 502; skipping snapshot
[health] AC for <project> on port 8300 is down (failure 1/3) — auto-restarting
[snapshot] <project> history fetch returned 502; skipping snapshot
[health] AC for <project> auto-restarted (PID: <n>)
[agentchattr] <project> auto-reset 4 agent(s) after AC restart
[agentchattr] <project> auto-reset 4 agent(s) after AC restart
[agentchattr] <project> auto-reset 4 agent(s) after AC restart
[health] AC for <project> recovered (port 8300 alive)
```

What happens (timeline):
1. Worktrees clone successfully.
2. Dashboard fetches AC history → 502 (AC not bound yet).
3. Health monitor's first 30s tick sees port 8300 down → auto-restarts AC.
4. **`auto-reset 4 agent(s)` fires three times in a row.** Each reset re-registers all four agents via `/api/register` against AC.
5. AC eventually recovers.

The triple-fire of `auto-reset` is the trigger for Symptom 2.

## Symptom 2 — ghost agent identity proliferation (most user-visible)
After the cascade above, AC's registry has accumulated a long list of suffixed identities. Evidence from one session's `agentchattr.log`:

- **597 total `409 Conflict` heartbeat responses** across the log
- Distinct ghost identities observed: `head-2 … head-13`, `dev-1 … dev-10`, `re1-1 … re1-13`, `re2-2 … re2-10`
- Per-identity failed-heartbeat counts: 13–18 for each ghost slot

This becomes user-visible in the AGENTCHATTR primary chat: a single user prompt (`@head ping dev, re1, and re2 to check if they're online`) produced ~25 reply messages, most from suffixed ghost agents searching for mentions addressed to *their own* suffix and finding none. Excerpt (paraphrased, names preserved):

> head-13: I checked #general for @head-2 mentions and found none. The only current message is addressed to @head, so I am not taking action on it as @head-2.

> head-7: I attempted to send as head-3, but AgentChattr reports that session as stale and will not let me reclaim it because head-3 is already claimed.

> re1-12: @user I re-read #general for @re1-3. The only @re1-3 occurrence is the prior status note saying no direct actionable mention was found.

The agents are confused about their own identity, repeatedly searching for mentions that don't exist, and answering as ghost slots. The user-prompt-to-noise ratio is roughly 1:25.

### Likely mechanism
- AC's slot system suffixes a registration when the canonical name is already claimed (`head-2` if `head` is taken).
- Quadwork's auto-restart path tries to deregister the old slot and re-register with `force: true` to claim slot 1 (per code comments around `server/index.js:380-388, 417-419` and `#478`).
- During the triple-fire cascade, the deregister-then-register sequence isn't reliably claiming slot 1 — every retry pushes the suffix counter up by one.
- The agents keep heartbeating as their previously-acquired ghost slot, so old suffixed identities stay alive in AC's registry until their 120s crash timeout (per `[idle-fix] increased crash timeout to 120s (#502)`).

## Symptom 3 — Dashboard "Restart server and agents" leaves AC dead
After Symptom 2 manifested, clicking the **Restart** button (server controls) produces:
```
[agentchattr] <project> auto-reset 4 agent(s) after AC restart
[agentchattr] <project> auto-reset 4 agent(s) after AC restart
[agentchattr] <project> auto-reset 4 agent(s) after AC restart
[health] AC for <project> recovered (port 8300 alive)
[#565] Agent head: AC not reachable on port 8300 after 30s. Spawning without chat integration.
[#565] Agent re1:  AC not reachable on port 8300 after 30s. Spawning without chat integration.
[#565] Agent head: AC not reachable after 60s — health monitor will restart agent when AC recovers.
[#565] Agent re2:  AC not reachable on port 8300 after 30s. Spawning without chat integration.
[#565] Agent re1:  AC not reachable after 60s — health monitor will restart agent when AC recovers.
[#565] Agent dev:  AC not reachable on port 8300 after 30s. Spawning without chat integration.
[#565] Agent re2:  AC not reachable after 60s — health monitor will restart agent when AC recovers.
[#565] Agent dev:  AC not reachable after 60s — health monitor will restart agent when AC recovers.
```

End state observed via `lsof -iTCP:8300` and `ps aux | grep run.py`:
- Nothing listening on 8300
- No `run.py` Python process
- AC log ends with: `INFO: Shutting down / Waiting for application shutdown / Application shutdown complete / Finished server process [<pid>]`

So the Restart action **kills AC successfully but never re-spawns it**. Agents are reset and try to register against a dead port → 30s timeout → fall back to no-chat → 60s deferred wait also fails.

## Suspected root causes

These three symptoms compound from issues already noted as out-of-scope follow-ups in #577 — but together they make the chat unusable on a fresh install, so they're worth promoting to first-class bugs:

1. **No `waitForAgentChattrReady` after spawn (RC2 in #577).** Both `bin/quadwork.js` (around the `ok("AgentChattr started")` line) and `server/index.js`'s `spawnChattr` declare success the moment `spawn()` returns a PID. AC takes several seconds to actually bind 8300 (templates load, MCP servers start, FastAPI lifespan). The dashboard's history-snapshot fetch and the health monitor both check before that's done → 502 + false-down → auto-restart triggered unnecessarily. **A simple `await waitForAgentChattrReady(8300, 30000)` after spawn would prevent the entire cascade.**

2. **Auto-reset triggered from multiple paths.** `[agentchattr] auto-reset 4 agent(s)` firing three times for one logical AC restart suggests duplicate triggers — possibly the health-monitor restart endpoint's chained reset (`#447`) plus the startup migrations (`[bridge-migrate] restarted AC`, `[ghost-fix] patched registry.py`) all firing resets independently. Worth deduping.

3. **Restart endpoint's spawn-new step is failing or skipped.** The kill half works (`Finished server process` logged); the spawn-new half doesn't appear to run, OR it runs but the new AC dies silently. Since #574-576 added stdio capture, an AC startup failure should show in `agentchattr.log` — but the log just ends with shutdown messages, no new startup banner. So the spawn step likely never executed. Worth tracing the `/api/agentchattr/<id>/restart` handler in `server/routes.js` and verifying it actually calls `spawnChattr` after `killProcessOnPort`.

4. **Pin checkout silently failing (RC5 in #577, still not addressed).** AC clone HEAD is at `main` (currently `0440f5d`), not the pinned commit `3e71d42`. The `pinned` branch doesn't exist locally on this install. The console.warn at `bin/quadwork.js:272-273` is buried mid-spinner during install. Now that --config is fixed, HEAD works, so this isn't breaking — but it's worth surfacing post-spinner so users know they're not on the pinned version.

5. **Legacy `~/.quadwork/agentchattr/` clone still created** alongside the per-project clone (RC6 in #577). Cosmetic cruft, but confusing during diagnosis.

## Environment
- macOS 12.x (Darwin 21.6.0), Intel x86_64
- Node 24.15.0, Python 3.14.4 (python.org installer), gh 2.91.0, codex 0.125.0, claude 2.1.119
- Quadwork 1.13.1 (npx cache `~/.npm/_npx/<hash>/`)
- AC clone HEAD: `0440f5d` (`chore: bump version to 0.4.0`); pin `3e71d42` not checked out
- Fresh install: `~/.quadwork`, all `~/.npm/_npx/*/node_modules/quadwork`, and project worktrees all removed before `npx quadwork@latest start`

## Reproduction
1. `rm -rf ~/.quadwork ~/.npm/_npx/*/node_modules/quadwork && rm -rf <project>-{head,dev,re1,re2}` worktrees
2. `npx quadwork@latest start`
3. Dashboard → create project, configure agents, save
4. Watch server log → observe triple `auto-reset` and `agentchattr.log` showing two startup banners interleaved
5. In chat, send a single `@head ping dev, re1, re2` message → observe ghost-suffixed agents replying
6. Click **Restart** in dashboard server controls → observe AC log ends at `Finished server process` with no new startup banner; `lsof -iTCP:8300` empty; `[#565]` flood resumes

## Suggested fix order (smallest blast radius first)
1. Add `waitForAgentChattrReady` after every `spawn()` of AC (wizard init, server `spawnChattr`, restart endpoint). This single change probably eliminates Symptoms 1 and 2 by preventing the false-down detection that triggers the auto-reset cascade.
2. Audit the `/api/agentchattr/<id>/restart` endpoint to confirm it actually invokes `spawnChattr` after `killProcessOnPort` and `waitForPortFree`. (Symptom 3.)
3. Dedupe `auto-reset` triggers — at most one should fire per AC restart event.
4. Carry the secondary recommendations from #577 (loud pin failure, legacy clone cleanup).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fresh 1.13.1 install: AC startup race → triple auto-reset → ghost agent identities (head-N, re1-N) + Restart button leaves AC dead #579

Fresh 1.13.1 install: AC startup race → triple auto-reset cascade → ghost agent identities + Restart button leaves AC dead

Context

Symptom 1 — startup race triggers triple auto-reset

Symptom 2 — ghost agent identity proliferation (most user-visible)

Likely mechanism

Symptom 3 — Dashboard "Restart server and agents" leaves AC dead

Suspected root causes

Environment

Reproduction

Suggested fix order (smallest blast radius first)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Fresh 1.13.1 install: AC startup race → triple auto-reset → ghost agent identities (head-N, re1-N) + Restart button leaves AC dead #579

Description

Fresh 1.13.1 install: AC startup race → triple auto-reset cascade → ghost agent identities + Restart button leaves AC dead

Context

Symptom 1 — startup race triggers triple auto-reset

Symptom 2 — ghost agent identity proliferation (most user-visible)

Likely mechanism

Symptom 3 — Dashboard "Restart server and agents" leaves AC dead

Suspected root causes

Environment

Reproduction

Suggested fix order (smallest blast radius first)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions