Skip to content

fix: resolve captureTurn when app-server disconnects mid-turn#243

Open
ruan330 wants to merge 1 commit intoopenai:mainfrom
ruan330:fix/capture-turn-app-server-disconnect
Open

fix: resolve captureTurn when app-server disconnects mid-turn#243
ruan330 wants to merge 1 commit intoopenai:mainfrom
ruan330:fix/capture-turn-app-server-disconnect

Conversation

@ruan330
Copy link
Copy Markdown

@ruan330 ruan330 commented Apr 18, 2026

Summary

captureTurn hangs indefinitely when the Codex app-server disconnects before sending turn/completed or a final_answer-phase agentMessage. The companion process exits via IPC EOF bypassing runTrackedJob's try/catch, leaving a zombie job record (status: "running") and no final verdict. In broker mode the hang is worse: the broker stays alive with no upstream, so clients get neither turn/completed nor a socket EOF.

This PR adds a watchdog on both transports, preserves the existing inferred-success semantics when Codex disconnects after a completed final_answer, and covers the lifecycle with three regression tests.

Observed in the wild

/codex:adversarial-review sessions on a user workstation hung three times in a row on 2026-04-18. Logs showed Codex CLI emitted exactly one agentMessage (phase plan), ran ~15 shell commands, then exited cleanly without turn/completed. Claude Code reported exit code 0, but no final verdict was printed, /codex:status reported the job as running indefinitely, and /codex:result returned No job found.

Root cause is not adversarial-review specific — any runAppServerTurn caller shaped like this hangs. /codex:review uses runAppServerReview and is unaffected.

Root cause

captureTurn blocks on state.completion which only resolves via:

A. turn/completed notification with matching threadId, or
B. finalAnswerSeen + scheduleInferredCompletion (fires 250ms after an agentMessage with phase === "final_answer")

Both paths assume the Codex CLI cooperates and emits a terminal signal before disconnecting. When the CLI exits mid-turn (rate limit / OOM / internal error / token limit), neither triggers — state.completion hangs forever.

In broker mode, app-server-broker.mjs forwards notifications from appClient to socket clients but does not react to appClient.exitPromise. A dead upstream leaves the broker idle-alive; clients see neither a completion event nor a socket close.

Fix (three coupled parts, one semantic unit)

1. captureTurn watchdog with success preservation (plugins/codex/scripts/lib/codex.mjs)

Subscribe to client.exitPromise as a third resolution path. If a final_answer agentMessage was already captured AND no subagent work is outstanding, honor the existing inferred-success semantics — call completeTurn(state, null, { inferred: true }) without waiting the 250ms debounce (meaningless on a dead socket). Only when no terminal signal was captured do we synthesize a failed turn so buildResultStatus returns non-zero.

This alignment is important: an earlier iteration of this PR unconditionally marked disconnect as failed, which regressed the valid case of Codex sending a complete answer and then closing cleanly. The current shape matches scheduleInferredCompletion's authority rule (finalAnswerSeen + no pending work → success).

2. Broker propagates upstream exit (plugins/codex/scripts/app-server-broker.mjs)

Subscribe to the upstream appClient.exitPromise. When the Codex CLI dies, tear down the server (close sockets, unlink endpoint + pid files) and process.exit(1). Connected clients receive socket EOF; their captureTurn resolves via #1. Exit status 1 lets a supervisor restart the broker.

3. Broker watchdog skips intentional shutdown

The watchdog in #2 gates on a shuttingDown flag set by shutdown(). Without this guard, intentional teardown paths (broker/shutdown, SIGTERM, SIGINT) all call shutdown(server)appClient.close() which resolves the same exitPromise the watchdog listens on — re-entering shutdown() and racing the intentional process.exit(0) with process.exit(1). This race turned every graceful shutdown into a spurious crash exit.

Why the obvious small fixes were rejected

  • Relax finalAnswerSeen to any lifecycle === "completed" agentMessage: changes semantics for well-behaved turns, can commit prematurely on intermediate plan-phase messages.
  • Only a max-turn timeout: loses lastAgentMessage diagnostic, fires on legitimately long turns, and doesn't catch fast disconnects.
  • Broker-side timeout alone: doesn't help direct-transport callers.
  • Unconditional disconnect → failed: regresses the final_answer-then-exit success case (see Add Gemini CLI extension commands #1 above).

Tests

Three regression tests in tests/runtime.test.mjs driven by new fake behaviors in tests/fake-codex-fixture.mjs:

Behavior Scenario Assertion
disconnect-before-completion plan-phase agentMessage, then exit — no final_answer adversarial-review exits non-zero + output mentions disconnect
final-answer-then-exit completed final_answer agentMessage, then exit — no turn/completed exit 0 + structured findings preserved
broker exits cleanly on broker/shutdown non-detached broker receives broker/shutdown RPC exit 0, no signal (proves watchdog doesn't trip during graceful teardown)

All three new tests pass. Existing-test pass rate on the same machine is identical with vs without this patch (i.e. the patch introduces zero new failures) — CI on this PR will be authoritative for the absolute pass count.

Review cycle

Two rounds of Codex adversarial review on the fix itself caught two high-severity races that the first iteration missed:

  1. Broker shutdown raceshutdown() resolves the same exitPromise the watchdog listens on, re-entering and forcing exit(1) during SIGTERM / broker/shutdown. Flagged by GitHub Codex inline comment + /codex:adversarial-review first pass; addressed by Review fails due to unknown sandboxing variant #3.
  2. Disconnect overrides success — the initial exitPromise handler unconditionally marked failed, overriding scheduleInferredCompletion's valid final_answer success path when disconnect raced the 250ms debounce. Flagged by /codex:adversarial-review second pass; addressed by Add Gemini CLI extension commands #1.

All findings addressed. The final squashed commit (14123d3) received approve verdict from a third adversarial review pass.

Upstream pointer

Why the Codex CLI can exit mid-turn without flushing turn/completed is orthogonal to this PR and belongs in openai/codex. The watchdog here makes the companion resilient regardless.

Test plan

  • All 3 new regression tests pass
  • Patch introduces zero new failures relative to baseline on this machine
  • Verified real-world hang reproduces in 10s via fake, fixed after patch

@ruan330 ruan330 requested a review from a team April 18, 2026 10:08
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b9b4d141c8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +125 to +126
appClient.exitPromise.then(() => {
shutdown(server).catch(() => {}).finally(() => process.exit(1));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Ignore intentional broker shutdowns in exit watcher

appClient.exitPromise resolves on both upstream crashes and normal appClient.close() calls. In this script, all intentional termination paths (broker/shutdown, SIGTERM, and SIGINT) call shutdown(server), which closes appClient, so this watcher also triggers and runs shutdown(server) again before forcing process.exit(1). That can turn graceful shutdowns into crash exits and cause shutdown reentrancy/races during normal broker teardown.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 14123d3. Added a shuttingDown flag set by shutdown() and checked in the watchdog, so intentional teardown paths no longer race. Regression test: broker exits cleanly on broker/shutdown without tripping the upstream-exit watchdog.

Root cause
----------
`captureTurn` (plugins/codex/scripts/lib/codex.mjs) blocks on
`state.completion`. Two existing resolution paths both require server
cooperation:

  A. `turn/completed` notification from Codex app-server
  B. `finalAnswerSeen + scheduleInferredCompletion` (fires 250ms after
     an `agentMessage` with `phase === "final_answer"`)

When the Codex CLI exits mid-turn — rate limit, OOM, internal error,
token limit — neither fires: the only `agentMessage` had `phase ==
"plan"` so `finalAnswerSeen` stayed `false`, and no `turn/completed`
was ever sent. `state.completion` hangs indefinitely. The node
companion process itself exits cleanly via IPC EOF, bypassing
`runTrackedJob`'s try/catch, leaving a zombie job record
(`status: "running"`) and no final verdict. Observed three times in
the wild on 2026-04-18 while running `/codex:adversarial-review` from
Claude Code.

In broker mode (`app-server-broker.mjs`) the bug was worse: the broker
did not watch the upstream `appClient`, so when the Codex CLI died the
broker kept accepting socket connections. Clients got neither
`turn/completed` nor a socket EOF.

Fix (three coupled changes, one semantic unit)
-----------------------------------------------
1. `captureTurn` subscribes to `client.exitPromise` as a third
   resolution path. If a `final_answer` agentMessage has already been
   captured AND no subagent work is outstanding, honor the existing
   inferred-success semantics — call `completeTurn(state, null, {
   inferred: true })` without the 250ms debounce, which is meaningless
   on a dead socket. Only when no terminal signal was captured do we
   synthesize a `failed` turn so `buildResultStatus` returns non-zero.

2. `app-server-broker.mjs` subscribes to the upstream
   `appClient.exitPromise` and tears down the server when the Codex
   CLI dies. Connected clients receive socket EOF and the watchdog in
   openai#1 resolves their `captureTurn`. Broker exits with status 1 so
   supervisors can restart it.

3. The broker watchdog skips when `shuttingDown` is true. Without this
   guard, intentional teardown paths (`broker/shutdown`, `SIGTERM`,
   `SIGINT`) all call `shutdown(server)` → `appClient.close()` which
   resolves the same `exitPromise`, re-entering `shutdown()` and
   racing `process.exit(0)` with `process.exit(1)`.

Why the obvious small fixes were rejected
------------------------------------------
  - Relax `finalAnswerSeen` to any `lifecycle === "completed"`
    agentMessage: changes semantics for well-behaved turns; can commit
    prematurely on intermediate plan-phase messages.
  - Only a max-turn timeout: loses `lastAgentMessage` diagnostic, fires
    on legitimately long turns, and doesn't catch fast disconnects.
  - Broker-side timeout alone: doesn't help direct-transport callers.

Tests
-----
Three regression tests in `tests/runtime.test.mjs` against new fake
behaviors in `tests/fake-codex-fixture.mjs`:

  * `adversarial review resolves when app-server disconnects before
    turn/completed` — plan-phase agentMessage then disconnect; asserts
    non-zero exit + rendered output mentions the disconnect.
  * `adversarial review preserves final_answer when app-server exits
    before turn/completed` — final_answer agentMessage then
    disconnect; asserts exit 0 + structured findings survive.
  * `broker exits cleanly on broker/shutdown without tripping the
    upstream-exit watchdog` — sends `broker/shutdown` RPC to a
    non-detached broker, asserts exit 0.

Baseline suite: 88 tests, 82 pass, 6 fail pre-existing (status/result
caching in test state — unrelated). With this patch: 90 tests, 85 pass,
6 fail. +3 new passing tests, no regression.

Upstream pointer
----------------
Why the Codex CLI can exit mid-turn without flushing `turn/completed`
is orthogonal to this PR and belongs in `openai/codex`. The watchdog
here makes the companion resilient regardless.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ruan330 ruan330 force-pushed the fix/capture-turn-app-server-disconnect branch from b9b4d14 to 14123d3 Compare April 18, 2026 11:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant