Skip to content

Fresh 1.13.1 install: AC startup race → triple auto-reset → ghost agent identities (head-N, re1-N) + Restart button leaves AC dead #579

@realproject7

Description

@realproject7

Fresh 1.13.1 install: AC startup race → triple auto-reset cascade → ghost agent identities + Restart button leaves AC dead

Context

Builds on #577 (--config flag fix, shipped in 1.13.1). The argparse error is gone — AC now starts and binds 8300. But on a clean install (rm -rf ~/.quadwork ~/.npm/_npx/*/*/quadwork, then npx quadwork@latest start → create project via dashboard), three layered issues surface that compound into a broken chat experience.

Symptom 1 — startup race triggers triple auto-reset

Server log on first project creation:

QuadWork server listening on http://127.0.0.1:8400
[health] AC health monitor started (30s interval)
Cloning into '/Users/<user>/Projects/<project>'...
Preparing worktree (checking out 'worktree-head')
Preparing worktree (checking out 'worktree-re1')
Preparing worktree (checking out 'worktree-re2')
Preparing worktree (checking out 'worktree-dev')
[snapshot] <project> history fetch returned 502; skipping snapshot
[snapshot] <project> history fetch returned 502; skipping snapshot
[health] AC for <project> on port 8300 is down (failure 1/3) — auto-restarting
[snapshot] <project> history fetch returned 502; skipping snapshot
[health] AC for <project> auto-restarted (PID: <n>)
[agentchattr] <project> auto-reset 4 agent(s) after AC restart
[agentchattr] <project> auto-reset 4 agent(s) after AC restart
[agentchattr] <project> auto-reset 4 agent(s) after AC restart
[health] AC for <project> recovered (port 8300 alive)

What happens (timeline):

  1. Worktrees clone successfully.
  2. Dashboard fetches AC history → 502 (AC not bound yet).
  3. Health monitor's first 30s tick sees port 8300 down → auto-restarts AC.
  4. auto-reset 4 agent(s) fires three times in a row. Each reset re-registers all four agents via /api/register against AC.
  5. AC eventually recovers.

The triple-fire of auto-reset is the trigger for Symptom 2.

Symptom 2 — ghost agent identity proliferation (most user-visible)

After the cascade above, AC's registry has accumulated a long list of suffixed identities. Evidence from one session's agentchattr.log:

  • 597 total 409 Conflict heartbeat responses across the log
  • Distinct ghost identities observed: head-2 … head-13, dev-1 … dev-10, re1-1 … re1-13, re2-2 … re2-10
  • Per-identity failed-heartbeat counts: 13–18 for each ghost slot

This becomes user-visible in the AGENTCHATTR primary chat: a single user prompt (@head ping dev, re1, and re2 to check if they're online) produced ~25 reply messages, most from suffixed ghost agents searching for mentions addressed to their own suffix and finding none. Excerpt (paraphrased, names preserved):

head-13: I checked #general for @head-2 mentions and found none. The only current message is addressed to @Head, so I am not taking action on it as @head-2.

head-7: I attempted to send as head-3, but AgentChattr reports that session as stale and will not let me reclaim it because head-3 is already claimed.

re1-12: @user I re-read #general for @re1-3. The only @re1-3 occurrence is the prior status note saying no direct actionable mention was found.

The agents are confused about their own identity, repeatedly searching for mentions that don't exist, and answering as ghost slots. The user-prompt-to-noise ratio is roughly 1:25.

Likely mechanism

  • AC's slot system suffixes a registration when the canonical name is already claimed (head-2 if head is taken).
  • Quadwork's auto-restart path tries to deregister the old slot and re-register with force: true to claim slot 1 (per code comments around server/index.js:380-388, 417-419 and #478).
  • During the triple-fire cascade, the deregister-then-register sequence isn't reliably claiming slot 1 — every retry pushes the suffix counter up by one.
  • The agents keep heartbeating as their previously-acquired ghost slot, so old suffixed identities stay alive in AC's registry until their 120s crash timeout (per [idle-fix] increased crash timeout to 120s (#502)).

Symptom 3 — Dashboard "Restart server and agents" leaves AC dead

After Symptom 2 manifested, clicking the Restart button (server controls) produces:

[agentchattr] <project> auto-reset 4 agent(s) after AC restart
[agentchattr] <project> auto-reset 4 agent(s) after AC restart
[agentchattr] <project> auto-reset 4 agent(s) after AC restart
[health] AC for <project> recovered (port 8300 alive)
[#565] Agent head: AC not reachable on port 8300 after 30s. Spawning without chat integration.
[#565] Agent re1:  AC not reachable on port 8300 after 30s. Spawning without chat integration.
[#565] Agent head: AC not reachable after 60s — health monitor will restart agent when AC recovers.
[#565] Agent re2:  AC not reachable on port 8300 after 30s. Spawning without chat integration.
[#565] Agent re1:  AC not reachable after 60s — health monitor will restart agent when AC recovers.
[#565] Agent dev:  AC not reachable on port 8300 after 30s. Spawning without chat integration.
[#565] Agent re2:  AC not reachable after 60s — health monitor will restart agent when AC recovers.
[#565] Agent dev:  AC not reachable after 60s — health monitor will restart agent when AC recovers.

End state observed via lsof -iTCP:8300 and ps aux | grep run.py:

  • Nothing listening on 8300
  • No run.py Python process
  • AC log ends with: INFO: Shutting down / Waiting for application shutdown / Application shutdown complete / Finished server process [<pid>]

So the Restart action kills AC successfully but never re-spawns it. Agents are reset and try to register against a dead port → 30s timeout → fall back to no-chat → 60s deferred wait also fails.

Suspected root causes

These three symptoms compound from issues already noted as out-of-scope follow-ups in #577 — but together they make the chat unusable on a fresh install, so they're worth promoting to first-class bugs:

  1. No waitForAgentChattrReady after spawn (RC2 in AC fails to start: --config <path> flag rejected by run.py argparse (causes [#565] AC not reachable) #577). Both bin/quadwork.js (around the ok("AgentChattr started") line) and server/index.js's spawnChattr declare success the moment spawn() returns a PID. AC takes several seconds to actually bind 8300 (templates load, MCP servers start, FastAPI lifespan). The dashboard's history-snapshot fetch and the health monitor both check before that's done → 502 + false-down → auto-restart triggered unnecessarily. A simple await waitForAgentChattrReady(8300, 30000) after spawn would prevent the entire cascade.

  2. Auto-reset triggered from multiple paths. [agentchattr] auto-reset 4 agent(s) firing three times for one logical AC restart suggests duplicate triggers — possibly the health-monitor restart endpoint's chained reset (#447) plus the startup migrations ([bridge-migrate] restarted AC, [ghost-fix] patched registry.py) all firing resets independently. Worth deduping.

  3. Restart endpoint's spawn-new step is failing or skipped. The kill half works (Finished server process logged); the spawn-new half doesn't appear to run, OR it runs but the new AC dies silently. Since [#573] Separate install wizard from server launch #574-576 added stdio capture, an AC startup failure should show in agentchattr.log — but the log just ends with shutdown messages, no new startup banner. So the spawn step likely never executed. Worth tracing the /api/agentchattr/<id>/restart handler in server/routes.js and verifying it actually calls spawnChattr after killProcessOnPort.

  4. Pin checkout silently failing (RC5 in AC fails to start: --config <path> flag rejected by run.py argparse (causes [#565] AC not reachable) #577, still not addressed). AC clone HEAD is at main (currently 0440f5d), not the pinned commit 3e71d42. The pinned branch doesn't exist locally on this install. The console.warn at bin/quadwork.js:272-273 is buried mid-spinner during install. Now that --config is fixed, HEAD works, so this isn't breaking — but it's worth surfacing post-spinner so users know they're not on the pinned version.

  5. Legacy ~/.quadwork/agentchattr/ clone still created alongside the per-project clone (RC6 in AC fails to start: --config <path> flag rejected by run.py argparse (causes [#565] AC not reachable) #577). Cosmetic cruft, but confusing during diagnosis.

Environment

  • macOS 12.x (Darwin 21.6.0), Intel x86_64
  • Node 24.15.0, Python 3.14.4 (python.org installer), gh 2.91.0, codex 0.125.0, claude 2.1.119
  • Quadwork 1.13.1 (npx cache ~/.npm/_npx/<hash>/)
  • AC clone HEAD: 0440f5d (chore: bump version to 0.4.0); pin 3e71d42 not checked out
  • Fresh install: ~/.quadwork, all ~/.npm/_npx/*/node_modules/quadwork, and project worktrees all removed before npx quadwork@latest start

Reproduction

  1. rm -rf ~/.quadwork ~/.npm/_npx/*/node_modules/quadwork && rm -rf <project>-{head,dev,re1,re2} worktrees
  2. npx quadwork@latest start
  3. Dashboard → create project, configure agents, save
  4. Watch server log → observe triple auto-reset and agentchattr.log showing two startup banners interleaved
  5. In chat, send a single @head ping dev, re1, re2 message → observe ghost-suffixed agents replying
  6. Click Restart in dashboard server controls → observe AC log ends at Finished server process with no new startup banner; lsof -iTCP:8300 empty; [#565] flood resumes

Suggested fix order (smallest blast radius first)

  1. Add waitForAgentChattrReady after every spawn() of AC (wizard init, server spawnChattr, restart endpoint). This single change probably eliminates Symptoms 1 and 2 by preventing the false-down detection that triggers the auto-reset cascade.
  2. Audit the /api/agentchattr/<id>/restart endpoint to confirm it actually invokes spawnChattr after killProcessOnPort and waitForPortFree. (Symptom 3.)
  3. Dedupe auto-reset triggers — at most one should fire per AC restart event.
  4. Carry the secondary recommendations from AC fails to start: --config <path> flag rejected by run.py argparse (causes [#565] AC not reachable) #577 (loud pin failure, legacy clone cleanup).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions