You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fresh 1.13.1 install: AC startup race → triple auto-reset cascade → ghost agent identities + Restart button leaves AC dead
Context
Builds on #577 (--config flag fix, shipped in 1.13.1). The argparse error is gone — AC now starts and binds 8300. But on a clean install (rm -rf ~/.quadwork ~/.npm/_npx/*/*/quadwork, then npx quadwork@latest start → create project via dashboard), three layered issues surface that compound into a broken chat experience.
QuadWork server listening on http://127.0.0.1:8400
[health] AC health monitor started (30s interval)
Cloning into '/Users/<user>/Projects/<project>'...
Preparing worktree (checking out 'worktree-head')
Preparing worktree (checking out 'worktree-re1')
Preparing worktree (checking out 'worktree-re2')
Preparing worktree (checking out 'worktree-dev')
[snapshot] <project> history fetch returned 502; skipping snapshot
[snapshot] <project> history fetch returned 502; skipping snapshot
[health] AC for <project> on port 8300 is down (failure 1/3) — auto-restarting
[snapshot] <project> history fetch returned 502; skipping snapshot
[health] AC for <project> auto-restarted (PID: <n>)
[agentchattr] <project> auto-reset 4 agent(s) after AC restart
[agentchattr] <project> auto-reset 4 agent(s) after AC restart
[agentchattr] <project> auto-reset 4 agent(s) after AC restart
[health] AC for <project> recovered (port 8300 alive)
What happens (timeline):
Worktrees clone successfully.
Dashboard fetches AC history → 502 (AC not bound yet).
Health monitor's first 30s tick sees port 8300 down → auto-restarts AC.
auto-reset 4 agent(s) fires three times in a row. Each reset re-registers all four agents via /api/register against AC.
AC eventually recovers.
The triple-fire of auto-reset is the trigger for Symptom 2.
Per-identity failed-heartbeat counts: 13–18 for each ghost slot
This becomes user-visible in the AGENTCHATTR primary chat: a single user prompt (@head ping dev, re1, and re2 to check if they're online) produced ~25 reply messages, most from suffixed ghost agents searching for mentions addressed to their own suffix and finding none. Excerpt (paraphrased, names preserved):
head-13: I checked #general for @head-2 mentions and found none. The only current message is addressed to @Head, so I am not taking action on it as @head-2.
head-7: I attempted to send as head-3, but AgentChattr reports that session as stale and will not let me reclaim it because head-3 is already claimed.
re1-12: @user I re-read #general for @re1-3. The only @re1-3 occurrence is the prior status note saying no direct actionable mention was found.
The agents are confused about their own identity, repeatedly searching for mentions that don't exist, and answering as ghost slots. The user-prompt-to-noise ratio is roughly 1:25.
Likely mechanism
AC's slot system suffixes a registration when the canonical name is already claimed (head-2 if head is taken).
Quadwork's auto-restart path tries to deregister the old slot and re-register with force: true to claim slot 1 (per code comments around server/index.js:380-388, 417-419 and #478).
During the triple-fire cascade, the deregister-then-register sequence isn't reliably claiming slot 1 — every retry pushes the suffix counter up by one.
The agents keep heartbeating as their previously-acquired ghost slot, so old suffixed identities stay alive in AC's registry until their 120s crash timeout (per [idle-fix] increased crash timeout to 120s (#502)).
Symptom 3 — Dashboard "Restart server and agents" leaves AC dead
After Symptom 2 manifested, clicking the Restart button (server controls) produces:
[agentchattr] <project> auto-reset 4 agent(s) after AC restart
[agentchattr] <project> auto-reset 4 agent(s) after AC restart
[agentchattr] <project> auto-reset 4 agent(s) after AC restart
[health] AC for <project> recovered (port 8300 alive)
[#565] Agent head: AC not reachable on port 8300 after 30s. Spawning without chat integration.
[#565] Agent re1: AC not reachable on port 8300 after 30s. Spawning without chat integration.
[#565] Agent head: AC not reachable after 60s — health monitor will restart agent when AC recovers.
[#565] Agent re2: AC not reachable on port 8300 after 30s. Spawning without chat integration.
[#565] Agent re1: AC not reachable after 60s — health monitor will restart agent when AC recovers.
[#565] Agent dev: AC not reachable on port 8300 after 30s. Spawning without chat integration.
[#565] Agent re2: AC not reachable after 60s — health monitor will restart agent when AC recovers.
[#565] Agent dev: AC not reachable after 60s — health monitor will restart agent when AC recovers.
End state observed via lsof -iTCP:8300 and ps aux | grep run.py:
Nothing listening on 8300
No run.py Python process
AC log ends with: INFO: Shutting down / Waiting for application shutdown / Application shutdown complete / Finished server process [<pid>]
So the Restart action kills AC successfully but never re-spawns it. Agents are reset and try to register against a dead port → 30s timeout → fall back to no-chat → 60s deferred wait also fails.
Suspected root causes
These three symptoms compound from issues already noted as out-of-scope follow-ups in #577 — but together they make the chat unusable on a fresh install, so they're worth promoting to first-class bugs:
No waitForAgentChattrReady after spawn (RC2 in AC fails to start: --config <path> flag rejected by run.py argparse (causes [#565] AC not reachable) #577). Both bin/quadwork.js (around the ok("AgentChattr started") line) and server/index.js's spawnChattr declare success the moment spawn() returns a PID. AC takes several seconds to actually bind 8300 (templates load, MCP servers start, FastAPI lifespan). The dashboard's history-snapshot fetch and the health monitor both check before that's done → 502 + false-down → auto-restart triggered unnecessarily. A simple await waitForAgentChattrReady(8300, 30000) after spawn would prevent the entire cascade.
Auto-reset triggered from multiple paths.[agentchattr] auto-reset 4 agent(s) firing three times for one logical AC restart suggests duplicate triggers — possibly the health-monitor restart endpoint's chained reset (#447) plus the startup migrations ([bridge-migrate] restarted AC, [ghost-fix] patched registry.py) all firing resets independently. Worth deduping.
Restart endpoint's spawn-new step is failing or skipped. The kill half works (Finished server process logged); the spawn-new half doesn't appear to run, OR it runs but the new AC dies silently. Since [#573] Separate install wizard from server launch #574-576 added stdio capture, an AC startup failure should show in agentchattr.log — but the log just ends with shutdown messages, no new startup banner. So the spawn step likely never executed. Worth tracing the /api/agentchattr/<id>/restart handler in server/routes.js and verifying it actually calls spawnChattr after killProcessOnPort.
Pin checkout silently failing (RC5 in AC fails to start: --config <path> flag rejected by run.py argparse (causes [#565] AC not reachable) #577, still not addressed). AC clone HEAD is at main (currently 0440f5d), not the pinned commit 3e71d42. The pinned branch doesn't exist locally on this install. The console.warn at bin/quadwork.js:272-273 is buried mid-spinner during install. Now that --config is fixed, HEAD works, so this isn't breaking — but it's worth surfacing post-spinner so users know they're not on the pinned version.
Dashboard → create project, configure agents, save
Watch server log → observe triple auto-reset and agentchattr.log showing two startup banners interleaved
In chat, send a single @head ping dev, re1, re2 message → observe ghost-suffixed agents replying
Click Restart in dashboard server controls → observe AC log ends at Finished server process with no new startup banner; lsof -iTCP:8300 empty; [#565] flood resumes
Suggested fix order (smallest blast radius first)
Add waitForAgentChattrReady after every spawn() of AC (wizard init, server spawnChattr, restart endpoint). This single change probably eliminates Symptoms 1 and 2 by preventing the false-down detection that triggers the auto-reset cascade.
Audit the /api/agentchattr/<id>/restart endpoint to confirm it actually invokes spawnChattr after killProcessOnPort and waitForPortFree. (Symptom 3.)
Dedupe auto-reset triggers — at most one should fire per AC restart event.
Fresh 1.13.1 install: AC startup race → triple auto-reset cascade → ghost agent identities + Restart button leaves AC dead
Context
Builds on #577 (--config flag fix, shipped in 1.13.1). The argparse error is gone — AC now starts and binds 8300. But on a clean install (
rm -rf ~/.quadwork ~/.npm/_npx/*/*/quadwork, thennpx quadwork@latest start→ create project via dashboard), three layered issues surface that compound into a broken chat experience.Symptom 1 — startup race triggers triple auto-reset
Server log on first project creation:
What happens (timeline):
auto-reset 4 agent(s)fires three times in a row. Each reset re-registers all four agents via/api/registeragainst AC.The triple-fire of
auto-resetis the trigger for Symptom 2.Symptom 2 — ghost agent identity proliferation (most user-visible)
After the cascade above, AC's registry has accumulated a long list of suffixed identities. Evidence from one session's
agentchattr.log:409 Conflictheartbeat responses across the loghead-2 … head-13,dev-1 … dev-10,re1-1 … re1-13,re2-2 … re2-10This becomes user-visible in the AGENTCHATTR primary chat: a single user prompt (
@head ping dev, re1, and re2 to check if they're online) produced ~25 reply messages, most from suffixed ghost agents searching for mentions addressed to their own suffix and finding none. Excerpt (paraphrased, names preserved):The agents are confused about their own identity, repeatedly searching for mentions that don't exist, and answering as ghost slots. The user-prompt-to-noise ratio is roughly 1:25.
Likely mechanism
head-2ifheadis taken).force: trueto claim slot 1 (per code comments aroundserver/index.js:380-388, 417-419and#478).[idle-fix] increased crash timeout to 120s (#502)).Symptom 3 — Dashboard "Restart server and agents" leaves AC dead
After Symptom 2 manifested, clicking the Restart button (server controls) produces:
End state observed via
lsof -iTCP:8300andps aux | grep run.py:run.pyPython processINFO: Shutting down / Waiting for application shutdown / Application shutdown complete / Finished server process [<pid>]So the Restart action kills AC successfully but never re-spawns it. Agents are reset and try to register against a dead port → 30s timeout → fall back to no-chat → 60s deferred wait also fails.
Suspected root causes
These three symptoms compound from issues already noted as out-of-scope follow-ups in #577 — but together they make the chat unusable on a fresh install, so they're worth promoting to first-class bugs:
No
waitForAgentChattrReadyafter spawn (RC2 in AC fails to start: --config <path> flag rejected by run.py argparse (causes [#565] AC not reachable) #577). Bothbin/quadwork.js(around theok("AgentChattr started")line) andserver/index.js'sspawnChattrdeclare success the momentspawn()returns a PID. AC takes several seconds to actually bind 8300 (templates load, MCP servers start, FastAPI lifespan). The dashboard's history-snapshot fetch and the health monitor both check before that's done → 502 + false-down → auto-restart triggered unnecessarily. A simpleawait waitForAgentChattrReady(8300, 30000)after spawn would prevent the entire cascade.Auto-reset triggered from multiple paths.
[agentchattr] auto-reset 4 agent(s)firing three times for one logical AC restart suggests duplicate triggers — possibly the health-monitor restart endpoint's chained reset (#447) plus the startup migrations ([bridge-migrate] restarted AC,[ghost-fix] patched registry.py) all firing resets independently. Worth deduping.Restart endpoint's spawn-new step is failing or skipped. The kill half works (
Finished server processlogged); the spawn-new half doesn't appear to run, OR it runs but the new AC dies silently. Since [#573] Separate install wizard from server launch #574-576 added stdio capture, an AC startup failure should show inagentchattr.log— but the log just ends with shutdown messages, no new startup banner. So the spawn step likely never executed. Worth tracing the/api/agentchattr/<id>/restarthandler inserver/routes.jsand verifying it actually callsspawnChattrafterkillProcessOnPort.Pin checkout silently failing (RC5 in AC fails to start: --config <path> flag rejected by run.py argparse (causes [#565] AC not reachable) #577, still not addressed). AC clone HEAD is at
main(currently0440f5d), not the pinned commit3e71d42. Thepinnedbranch doesn't exist locally on this install. The console.warn atbin/quadwork.js:272-273is buried mid-spinner during install. Now that --config is fixed, HEAD works, so this isn't breaking — but it's worth surfacing post-spinner so users know they're not on the pinned version.Legacy
~/.quadwork/agentchattr/clone still created alongside the per-project clone (RC6 in AC fails to start: --config <path> flag rejected by run.py argparse (causes [#565] AC not reachable) #577). Cosmetic cruft, but confusing during diagnosis.Environment
~/.npm/_npx/<hash>/)0440f5d(chore: bump version to 0.4.0); pin3e71d42not checked out~/.quadwork, all~/.npm/_npx/*/node_modules/quadwork, and project worktrees all removed beforenpx quadwork@latest startReproduction
rm -rf ~/.quadwork ~/.npm/_npx/*/node_modules/quadwork && rm -rf <project>-{head,dev,re1,re2}worktreesnpx quadwork@latest startauto-resetandagentchattr.logshowing two startup banners interleaved@head ping dev, re1, re2message → observe ghost-suffixed agents replyingFinished server processwith no new startup banner;lsof -iTCP:8300empty;[#565]flood resumesSuggested fix order (smallest blast radius first)
waitForAgentChattrReadyafter everyspawn()of AC (wizard init, serverspawnChattr, restart endpoint). This single change probably eliminates Symptoms 1 and 2 by preventing the false-down detection that triggers the auto-reset cascade./api/agentchattr/<id>/restartendpoint to confirm it actually invokesspawnChattrafterkillProcessOnPortandwaitForPortFree. (Symptom 3.)auto-resettriggers — at most one should fire per AC restart event.