fix: resolve captureTurn when app-server disconnects mid-turn by ruan330 · Pull Request #243 · openai/codex-plugin-cc

ruan330 · 2026-04-18T10:08:15Z

Summary

captureTurn hangs indefinitely when the Codex app-server disconnects before sending turn/completed or a final_answer-phase agentMessage. The companion process exits via IPC EOF bypassing runTrackedJob's try/catch, leaving a zombie job record (status: "running") and no final verdict. In broker mode the hang is worse: the broker stays alive with no upstream, so clients get neither turn/completed nor a socket EOF.

This PR adds a watchdog on both transports, preserves the existing inferred-success semantics when Codex disconnects after a completed final_answer, and covers the lifecycle with three regression tests.

Observed in the wild

/codex:adversarial-review sessions on a user workstation hung three times in a row on 2026-04-18. Logs showed Codex CLI emitted exactly one agentMessage (phase plan), ran ~15 shell commands, then exited cleanly without turn/completed. Claude Code reported exit code 0, but no final verdict was printed, /codex:status reported the job as running indefinitely, and /codex:result returned No job found.

Root cause is not adversarial-review specific — any runAppServerTurn caller shaped like this hangs. /codex:review uses runAppServerReview and is unaffected.

Root cause

captureTurn blocks on state.completion which only resolves via:

A. turn/completed notification with matching threadId, or
B. finalAnswerSeen + scheduleInferredCompletion (fires 250ms after an agentMessage with phase === "final_answer")

Both paths assume the Codex CLI cooperates and emits a terminal signal before disconnecting. When the CLI exits mid-turn (rate limit / OOM / internal error / token limit), neither triggers — state.completion hangs forever.

In broker mode, app-server-broker.mjs forwards notifications from appClient to socket clients but does not react to appClient.exitPromise. A dead upstream leaves the broker idle-alive; clients see neither a completion event nor a socket close.

Fix (three coupled parts, one semantic unit)

1. `captureTurn` watchdog with success preservation (`plugins/codex/scripts/lib/codex.mjs`)

Subscribe to client.exitPromise as a third resolution path. If a final_answer agentMessage was already captured AND no subagent work is outstanding, honor the existing inferred-success semantics — call completeTurn(state, null, { inferred: true }) without waiting the 250ms debounce (meaningless on a dead socket). Only when no terminal signal was captured do we synthesize a failed turn so buildResultStatus returns non-zero.

This alignment is important: an earlier iteration of this PR unconditionally marked disconnect as failed, which regressed the valid case of Codex sending a complete answer and then closing cleanly. The current shape matches scheduleInferredCompletion's authority rule (finalAnswerSeen + no pending work → success).

2. Broker propagates upstream exit (`plugins/codex/scripts/app-server-broker.mjs`)

Subscribe to the upstream appClient.exitPromise. When the Codex CLI dies, tear down the server (close sockets, unlink endpoint + pid files) and process.exit(1). Connected clients receive socket EOF; their captureTurn resolves via #1. Exit status 1 lets a supervisor restart the broker.

3. Broker watchdog skips intentional shutdown

The watchdog in #2 gates on a shuttingDown flag set by shutdown(). Without this guard, intentional teardown paths (broker/shutdown, SIGTERM, SIGINT) all call shutdown(server) → appClient.close() which resolves the same exitPromise the watchdog listens on — re-entering shutdown() and racing the intentional process.exit(0) with process.exit(1). This race turned every graceful shutdown into a spurious crash exit.

Why the obvious small fixes were rejected

Relax finalAnswerSeen to any lifecycle === "completed" agentMessage: changes semantics for well-behaved turns, can commit prematurely on intermediate plan-phase messages.
Only a max-turn timeout: loses lastAgentMessage diagnostic, fires on legitimately long turns, and doesn't catch fast disconnects.
Broker-side timeout alone: doesn't help direct-transport callers.
Unconditional disconnect → failed: regresses the final_answer-then-exit success case (see Add Gemini CLI extension commands #1 above).

Tests

Three regression tests in tests/runtime.test.mjs driven by new fake behaviors in tests/fake-codex-fixture.mjs:

Behavior	Scenario	Assertion
`disconnect-before-completion`	plan-phase agentMessage, then exit — no final_answer	`adversarial-review` exits non-zero + output mentions disconnect
`final-answer-then-exit`	completed final_answer agentMessage, then exit — no `turn/completed`	exit 0 + structured findings preserved
`broker exits cleanly on broker/shutdown`	non-detached broker receives `broker/shutdown` RPC	exit 0, no signal (proves watchdog doesn't trip during graceful teardown)

All three new tests pass. Existing-test pass rate on the same machine is identical with vs without this patch (i.e. the patch introduces zero new failures) — CI on this PR will be authoritative for the absolute pass count.

Review cycle

Two rounds of Codex adversarial review on the fix itself caught two high-severity races that the first iteration missed:

Broker shutdown race — shutdown() resolves the same exitPromise the watchdog listens on, re-entering and forcing exit(1) during SIGTERM / broker/shutdown. Flagged by GitHub Codex inline comment + /codex:adversarial-review first pass; addressed by Review fails due to unknown sandboxing variant #3.
Disconnect overrides success — the initial exitPromise handler unconditionally marked failed, overriding scheduleInferredCompletion's valid final_answer success path when disconnect raced the 250ms debounce. Flagged by /codex:adversarial-review second pass; addressed by Add Gemini CLI extension commands #1.

All findings addressed. The final squashed commit (14123d3) received approve verdict from a third adversarial review pass.

Upstream pointer

Why the Codex CLI can exit mid-turn without flushing turn/completed is orthogonal to this PR and belongs in openai/codex. The watchdog here makes the companion resilient regardless.

Test plan

All 3 new regression tests pass
Patch introduces zero new failures relative to baseline on this machine
Verified real-world hang reproduces in 10s via fake, fixed after patch

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b9b4d141c8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-18T10:14:35Z

+  appClient.exitPromise.then(() => {
+    shutdown(server).catch(() => {}).finally(() => process.exit(1));


Ignore intentional broker shutdowns in exit watcher

appClient.exitPromise resolves on both upstream crashes and normal appClient.close() calls. In this script, all intentional termination paths (broker/shutdown, SIGTERM, and SIGINT) call shutdown(server), which closes appClient, so this watcher also triggers and runs shutdown(server) again before forcing process.exit(1). That can turn graceful shutdowns into crash exits and cause shutdown reentrancy/races during normal broker teardown.

Useful? React with 👍 / 👎.

Fixed in 14123d3. Added a shuttingDown flag set by shutdown() and checked in the watchdog, so intentional teardown paths no longer race. Regression test: broker exits cleanly on broker/shutdown without tripping the upstream-exit watchdog.

Root cause ---------- `captureTurn` (plugins/codex/scripts/lib/codex.mjs) blocks on `state.completion`. Two existing resolution paths both require server cooperation: A. `turn/completed` notification from Codex app-server B. `finalAnswerSeen + scheduleInferredCompletion` (fires 250ms after an `agentMessage` with `phase === "final_answer"`) When the Codex CLI exits mid-turn — rate limit, OOM, internal error, token limit — neither fires: the only `agentMessage` had `phase == "plan"` so `finalAnswerSeen` stayed `false`, and no `turn/completed` was ever sent. `state.completion` hangs indefinitely. The node companion process itself exits cleanly via IPC EOF, bypassing `runTrackedJob`'s try/catch, leaving a zombie job record (`status: "running"`) and no final verdict. Observed three times in the wild on 2026-04-18 while running `/codex:adversarial-review` from Claude Code. In broker mode (`app-server-broker.mjs`) the bug was worse: the broker did not watch the upstream `appClient`, so when the Codex CLI died the broker kept accepting socket connections. Clients got neither `turn/completed` nor a socket EOF. Fix (three coupled changes, one semantic unit) ----------------------------------------------- 1. `captureTurn` subscribes to `client.exitPromise` as a third resolution path. If a `final_answer` agentMessage has already been captured AND no subagent work is outstanding, honor the existing inferred-success semantics — call `completeTurn(state, null, { inferred: true })` without the 250ms debounce, which is meaningless on a dead socket. Only when no terminal signal was captured do we synthesize a `failed` turn so `buildResultStatus` returns non-zero. 2. `app-server-broker.mjs` subscribes to the upstream `appClient.exitPromise` and tears down the server when the Codex CLI dies. Connected clients receive socket EOF and the watchdog in openai#1 resolves their `captureTurn`. Broker exits with status 1 so supervisors can restart it. 3. The broker watchdog skips when `shuttingDown` is true. Without this guard, intentional teardown paths (`broker/shutdown`, `SIGTERM`, `SIGINT`) all call `shutdown(server)` → `appClient.close()` which resolves the same `exitPromise`, re-entering `shutdown()` and racing `process.exit(0)` with `process.exit(1)`. Why the obvious small fixes were rejected ------------------------------------------ - Relax `finalAnswerSeen` to any `lifecycle === "completed"` agentMessage: changes semantics for well-behaved turns; can commit prematurely on intermediate plan-phase messages. - Only a max-turn timeout: loses `lastAgentMessage` diagnostic, fires on legitimately long turns, and doesn't catch fast disconnects. - Broker-side timeout alone: doesn't help direct-transport callers. Tests ----- Three regression tests in `tests/runtime.test.mjs` against new fake behaviors in `tests/fake-codex-fixture.mjs`: * `adversarial review resolves when app-server disconnects before turn/completed` — plan-phase agentMessage then disconnect; asserts non-zero exit + rendered output mentions the disconnect. * `adversarial review preserves final_answer when app-server exits before turn/completed` — final_answer agentMessage then disconnect; asserts exit 0 + structured findings survive. * `broker exits cleanly on broker/shutdown without tripping the upstream-exit watchdog` — sends `broker/shutdown` RPC to a non-detached broker, asserts exit 0. Baseline suite: 88 tests, 82 pass, 6 fail pre-existing (status/result caching in test state — unrelated). With this patch: 90 tests, 85 pass, 6 fail. +3 new passing tests, no regression. Upstream pointer ---------------- Why the Codex CLI can exit mid-turn without flushing `turn/completed` is orthogonal to this PR and belongs in `openai/codex`. The watchdog here makes the companion resilient regardless. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ruan330 requested a review from a team April 18, 2026 10:08

chatgpt-codex-connector bot reviewed Apr 18, 2026

View reviewed changes

ruan330 force-pushed the fix/capture-turn-app-server-disconnect branch from b9b4d14 to 14123d3 Compare April 18, 2026 11:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: resolve captureTurn when app-server disconnects mid-turn#243

fix: resolve captureTurn when app-server disconnects mid-turn#243
ruan330 wants to merge 1 commit intoopenai:mainfrom
ruan330:fix/capture-turn-app-server-disconnect

ruan330 commented Apr 18, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 18, 2026

Uh oh!

ruan330 Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		appClient.exitPromise.then(() => {
		shutdown(server).catch(() => {}).finally(() => process.exit(1));

Conversation

ruan330 commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Observed in the wild

Root cause

Fix (three coupled parts, one semantic unit)

1. captureTurn watchdog with success preservation (plugins/codex/scripts/lib/codex.mjs)

2. Broker propagates upstream exit (plugins/codex/scripts/app-server-broker.mjs)

3. Broker watchdog skips intentional shutdown

Why the obvious small fixes were rejected

Tests

Review cycle

Upstream pointer

Test plan

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

ruan330 Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ruan330 commented Apr 18, 2026 •

edited

Loading

1. `captureTurn` watchdog with success preservation (`plugins/codex/scripts/lib/codex.mjs`)

2. Broker propagates upstream exit (`plugins/codex/scripts/app-server-broker.mjs`)