Skip to content

Improve Codex multi-agent planning, question flow, and runtime DX#16

Open
ratley wants to merge 19 commits intomasterfrom
codex/release-orca-0.2.25
Open

Improve Codex multi-agent planning, question flow, and runtime DX#16
ratley wants to merge 19 commits intomasterfrom
codex/release-orca-0.2.25

Conversation

@ratley
Copy link
Copy Markdown
Collaborator

@ratley ratley commented Mar 17, 2026

Summary

  • update Orca to the newer Codex app-server/client surface and improve Codex binary selection/runtime diagnostics
  • add end-to-end clarification-question handling: persist pending questions, emit onQuestion, show them in orca status, and resume the same live run from orca answer
  • improve planning/review/execution prompt guidance for multi-agent-aware Codex runs and tighten the README/docs around the current Codex-only flow

Testing

  • bun test src
  • npm run typecheck:tsc
  • npm run build
  • live smoke for waiting_for_answer flow

Dependency

@github-actions
Copy link
Copy Markdown

Review findings

BUG: Cancelled run status reverts to waiting_for_answer

src/agents/codex/session.ts:825-828

if (currentRun.overallStatus === "cancelled") {
  rejectUserInputRequest(request.requestId, `Run ${interactionContext.runId} was cancelled while waiting for input.`);
  await clearPendingQuestion(request.requestId, "waiting_for_answer");
  return;
}

When a run is cancelled while waiting for user input, the code correctly detects the "cancelled" status and rejects the Codex request — but then calls clearPendingQuestion(request.requestId, "waiting_for_answer"), which writes overallStatus: "waiting_for_answer" back to the store, overwriting the "cancelled" status.

clearPendingQuestion (line 764-772) unconditionally sets overallStatus from its second argument:

await interactionContext.store.updateRun(interactionContext.runId, {
  overallStatus,          // ← "waiting_for_answer" overwrites "cancelled"
  pendingQuestion: undefined,
});

Failure scenario: User runs orca cancel --run <id> while a run is in waiting_for_answer. The store briefly shows "cancelled", but the question-flow handler immediately reverts it to "waiting_for_answer". The run appears stuck waiting for input instead of being cancelled. Subsequent orca status will show the wrong state.

Fix: Either skip clearPendingQuestion entirely when cancelled (just clear pendingQuestion without changing overallStatus), or guard clearPendingQuestion against downgrading a terminal status like "cancelled".

@github-actions
Copy link
Copy Markdown

BUG

  • src/core/question-flow.ts:127 accepts the documented {"answers": ...} form without verifying that every pending question ID is present and valid. orca answer runs that parser before writing answer.txt (src/cli/commands/answer.ts:104), so a typo or partial payload like {"answers":{"backedn":{"answers":["bun"]}}} is reported as success and then forwarded to Codex (src/agents/codex/session.ts:847). That sends an answer set that is missing the real question IDs, which can leave the clarification unresolved even though Orca has already treated the prompt as answered.
  • src/core/question-flow.ts:96 makes single-question answers that start with { go through the JSON path instead of being treated as plain text. That breaks legitimate non-interactive answers that are JSON/config/code snippets, e.g. orca answer <run-id> '{"useMigration":true}': the parser now either throws a JSON parse error or complains that the pending question ID is missing. Single-question answers used to accept arbitrary text, so this is a real regression for current CLI workflows.

@github-actions
Copy link
Copy Markdown

BUG

  • src/agents/codex/session.ts:760, src/agents/codex/session.ts:850, src/core/planner.ts:290 — planning/review turns now share the same interactive question handler, but clearPendingQuestion() can only restore overallStatus to "running". If Codex asks a clarification question during the planning gate or task-graph review, answering it makes orca status report a live run as Mode: plan / Overall Status: running until planning finishes. That is a real run-state regression against the documented lifecycle, and any polling/automation that waits for planning to finish will see execution as started before a task graph exists.
  • src/agents/codex/codex-path.ts:146, src/agents/codex/codex-path.ts:170 — the new binary chooser does not actually search all installed codex CLIs. It returns the first codex found on PATH, then compares only that path plus three hard-coded locations. A newer binary later on PATH (for example ~/.local/bin, pnpm, or another managed install) is never considered, so Orca can still launch the older incompatible CLI even though this PR claims it will auto-pick the newest available binary.

RISK

  • src/core/question-flow.ts:14, src/cli/commands/answer.ts:59, src/cli/commands/answer.ts:110isSecret is preserved from Codex prompts but never honored. When a current requestUserInput question is marked secret, Orca still collects it with plain input() and writes the response verbatim to <run>/answer.txt. That is a reproducible plaintext secret leak for API keys/passwords on the new question-flow path.

@github-actions
Copy link
Copy Markdown

BUG

  • src/core/question-flow.ts:45, src/core/question-flow.ts:76, src/cli/commands/status.ts:68: orca status now shows pending questions, but it never prints each question's id. That breaks the README-documented multi-question answer flow (orca answer <run-id> '{"answers":{...}}'), because users cannot discover which keys to send in the JSON payload from the CLI. Any run that asks 2+ questions will be effectively unanswerable unless the user has an onQuestion hook or reads status.json by hand.

RISK

  • src/agents/codex/session.ts:501, docs/codex-app-server.md:953: loadCodexListedSkills() now hard-requires skill.path from skills/list and silently drops entries without it. The new bundled app-server docs for skills/list only show name, description, enabled, interface, and dependencies, so if codex-client 0.1.5 follows that surface, Orca will stop loading every app-server-discovered skill, including codex.perCwdExtraUserRoots skills.

@github-actions
Copy link
Copy Markdown

BUG

  • src/agents/codex/session.ts:1136-1153: executeTask now treats any turn.status === "completed" as task success and returns { outcome: "done" } without parsing the assistant's required terminal JSON marker. A Codex turn can complete successfully at the protocol level while still reporting {"outcome":"failed","error":"..."} in its message body, so task failures will be misclassified as successes, retries will be skipped, and Orca can finalize a run as completed with unfinished work.

RISK

  • src/agents/codex/session.ts:992-1047 and 1058-1059: the new clarification waiter never stops when app-server clears a pending tool/requestUserInput via serverRequest:resolved. The bundled protocol doc explicitly says this notification is emitted when a request is cleared by turn completion or interruption before the user answers (docs/codex-app-server.md:888-890), but Orca only clears persisted state; the polling loop keeps waiting for answer.txt/IPC input forever because it only exits on cancellation or a parsed answer. In that case the live orca run can hang even though Codex has already moved on or finished the turn.

@github-actions
Copy link
Copy Markdown

BUG

  • src/agents/codex/session.ts:479 and src/agents/codex/session.ts:395: codex.perCwdExtraUserRoots no longer delivers usable skill instructions to Codex. The app-server docs show skills/list can return metadata-only entries without a path (docs/codex-app-server.md:940), and this code falls back to inlining only description/interface/dependencies text when that happens. Because turns now send only a plain text blob instead of skill input items, shared skills discovered only through skills/list never get their real SKILL.md body loaded. Users following the new README.md:208 guidance for app-server-visible extra skill roots will get a name/description stub instead of the actual workflow.

BUG

  • src/core/codex-config.ts:71 and src/agents/codex/session.ts:867: an explicit project opt-out cannot disable multi-agent-aware prompts once ~/.codex/config.toml has [features].multi_agent = true. isCodexMultiAgentActive() only checks for config.codex.multiAgent === true; any explicit false still falls through to the global file and returns true. That contradicts the documented contract in README.md:191, which says the global flag is only used when codex.multiAgent is not set in Orca config, and it means users cannot force single-agent planning/review/execution prompts per project after enabling multi-agent globally once.

RISK

  • src/agents/codex/session.ts:745 and src/agents/codex/session.ts:964: the new secret-answer flow persists the IPC socket path and bearer token into run state so orca answer can read them back. In any project-local or otherwise shared runsDir setup, any local process that can read that run directory can reuse the token and inject an arbitrary answer into the live run, so the new “secret” path only avoids writing the answer to answer.txt; it does not actually protect the prompt from other local readers.

@github-actions
Copy link
Copy Markdown

BUG

  • src/agents/codex/session.ts:400 and src/agents/codex/session.ts:599 remove skill input items entirely and fall back to metadata-only text when skills/list does not return a readable path. The new app-server reference explicitly documents that the skill item is what injects full skill instructions (docs/codex-app-server.md:909), and its skills/list example omits path entirely (docs/codex-app-server.md:953). That means any app-server-visible skill that Orca cannot read directly from disk now degrades to a short description/interface blob instead of the actual SKILL.md, so those skills silently stop influencing planning/execution for real users.

BUG

  • src/agents/codex/session.ts:1081 and src/agents/codex/session.ts:1145 can resurrect a cancelled run. When a run is cancelled while waiting for tool/requestUserInput, the loop rejects the pending request, but the unconditional serverRequest:resolved handler always calls clearPendingQuestion(..., resumedOverallStatus), which rewrites overallStatus back to planning/running. The app-server docs say serverRequest/resolved is emitted when a user-input request is answered or cleared (docs/codex-app-server.md:888), so cancelling a clarification prompt can leave status.json showing an active run again instead of staying cancelled.

@github-actions
Copy link
Copy Markdown

BUG

  • src/agents/codex/session.ts:152, src/agents/codex/session.ts:574: the new fallback success check now requires a completed fileChange item for any task that looks like it edits files. That is stricter than the execution contract in the same prompt, which explicitly allows arbitrary shell-based edits. A normal successful turn that edits files via commandExecution (cat > file, sed -i, scripts, etc.) and even passes validation will still be converted to outcome: "failed" if Codex forgets the trailing JSON marker, because no fileChange item was recorded. This turns a recoverable response-format miss into a hard task failure on real file-edit runs.
  • src/core/task-runner.ts:219, src/cli/commands/run.ts:524, src/cli/commands/run.ts:613: runs with review.execution.enabled = false now fire completion side effects twice. run.ts passes deferCompletion: reviewConfig.enabled, so with review disabled the task runner already marks the run completed, emits onComplete, and writes the session summary. After that, runCommandHandler still unconditionally emits onComplete again and rewrites the summary for every non-failed/non-cancelled run. Any onComplete hook that posts notifications, opens PRs, or triggers CI will now execute twice in that configuration.

@github-actions
Copy link
Copy Markdown

BUG

  • src/agents/codex/session.ts:1515 and src/agents/codex/session.ts:1529 start a brand-new Codex thread before the clarification pass and again before the real task execution. That breaks the persistent-per-run session contract documented in README.md and in src/agents/codex/session.ts:1140, because dependent tasks no longer inherit prior turn context. A concrete failure case is a multi-step run where task 1 makes a design choice or gathers nuanced clarification and task 2 says to continue with that choice; task 2 now runs in a fresh thread and only gets whatever was squeezed into clarificationContext, so real run decisions can be lost.

RISK

  • src/agents/codex/session.ts:1184 forces every non-execution turn into collaborationMode: { mode: "plan" } whenever Orca is running interactively, and src/cli/commands/run.ts:541 sends the post-execution reviewer through that path via runPrompt(prompt, "review"). The review prompt explicitly asks Codex to inspect changes and apply fixes, but this code now routes it through the plan preset instead of a normal/reviewer turn. In the review.execution.onFindings: "auto_fix" flow, that can leave the reviewer unable to perform the fixes it is being asked to make, so the loop stalls on findings that Orca previously auto-corrected.

Change summary: this PR updates Orca to the newer Codex surface, adds persisted clarification-question handling with orca answer/orca status, and reworks runtime diagnostics plus multi-agent/review prompt guidance.

@github-actions
Copy link
Copy Markdown

RISK

  • src/agents/codex/session.ts:1190, src/agents/codex/session.ts:1218, src/core/planner.ts:290, src/cli/commands/run.ts:486: normal orca plan/orca run execution now always routes through the new collaboration-mode/question-flow surface whenever a live run context exists, and there is no compatibility fallback if the selected Codex binary does not support it. A user pinned to an older Codex via ORCA_CODEX_PATH (or only having an older app-server build installed) will now fail during planning/consultation or the pre-execution clarification pass before any task runs, instead of degrading to the previous non-question-flow behavior. This is a real regression for existing Orca setups that intentionally override the binary path.

@github-actions
Copy link
Copy Markdown

BUG

  • src/core/planner.ts:299 and src/agents/codex/session.ts:1668: planning-time clarification answers are not preserved across the planning pipeline. runPlanner() now enables request_user_input during decidePlanningNeed(), but the adapter wrappers are explicitly stateless and create a fresh Codex thread for each call. If the gate asks a blocking question and returns needsPlan=true, the later planSpec() call starts from the original spec in a new thread, so the user's answer is no longer in context. The concrete failure mode is an ambiguous spec that asks for a choice during the planning gate (for example runtime/framework), then either re-asks that same question or generates a task graph that ignores the answer because the actual planner never sees it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant