Skip to content

fix: file-based kill signal IPC and resumable threads#4

Merged
Kevin7Qi merged 5 commits intomainfrom
fix/kill-signal-ipc
Mar 2, 2026
Merged

fix: file-based kill signal IPC and resumable threads#4
Kevin7Qi merged 5 commits intomainfrom
fix/kill-signal-ipc

Conversation

@Kevin7Qi
Copy link
Copy Markdown
Owner

@Kevin7Qi Kevin7Qi commented Mar 2, 2026

Summary

  • File-based kill signal IPC: kill writes a signal file to ~/.codex-collab/kill-signals/<threadId>, and executeTurn polls for it via Promise.race — covering both the turn/start request and completion phases. This replaces the previous approach of directly archiving threads on kill.
  • PID file tracking: Processes write PID files to ~/.codex-collab/pids/, enabling jobs to detect stale "running" threads when the owning process dies without cleanup (e.g. SIGKILL, crash).
  • Kill no longer archives: kill just interrupts the thread (status becomes "interrupted"), making it resumable via --resume. delete now handles permanent cleanup and also stops running threads before archiving.
  • Windows process cleanup: close() now awaits process exit on Windows (matching Unix behavior), preventing dangling promises from keeping the event loop alive.

Changes

File What changed
src/turns.ts Kill signal polling via createKillSignalAwaiter + two Promise.race stages; KillSignalError handling; stale signal cleanup
src/cli.ts PID file helpers; SIGTERM handler; cmdKill rewritten (signal-based, no archive); cmdDelete stops running threads; cmdJobs detects stale processes
src/threads.ts Thread ID validation at registration for filename safety
src/config.ts Added killSignalsDir and pidsDir config getters
src/types.ts Replaced "killed" with "interrupted" in status union
src/protocol.ts Windows close() awaits process exit with 3s timeout
src/protocol.test.ts Removed Windows afterEach delay workaround
src/turns.test.ts 20 tests covering kill signal races, error propagation, thread status
src/cli.test.ts Accept exit code 143 in health test (SIGTERM handler)
SKILL.md Updated kill command description and sandbox instructions

Test plan

  • All 103 unit tests pass (bun test)
  • E2E: kill running task → status becomes "interrupted" → resume picks up
  • E2E: delete running task → process stops, thread removed from jobs
  • E2E: delete interrupted task → archives on server, removed from jobs
  • Windows compatibility verified (process cleanup, filename safety)

Kevin7Qi added 2 commits March 2, 2026 16:34
…mable

Rework the kill command to use file-based signal IPC instead of
directly archiving threads. The running process now polls for a
signal file via Promise.race, enabling cross-process kill detection
even during slow turn/start requests.

Key changes:
- Kill signal polling starts before turn/start with two Promise.race
  stages covering both the request and completion phases
- PID file tracking (~/.codex-collab/pids/) enables stale "running"
  detection in `jobs` when the owning process dies without cleanup
- `kill` no longer archives threads — it just interrupts, making
  threads resumable. `delete` handles permanent cleanup and now
  also stops running threads before archiving.
- Status uses "interrupted" instead of the removed "killed" terminal
  state, with guard against killing non-running threads
- SIGTERM handler added alongside existing SIGINT for graceful shutdown
- Thread IDs validated at registration for Windows filename safety
- Comprehensive error handling: EPERM-aware process checks,
  try-catch in setInterval polling, warning logs for all catch blocks
- 20 tests covering kill signal races, error propagation, and
  thread status updates
close() on Windows was returning immediately after taskkill without
awaiting proc.exited or readLoop, leaving dangling promises that kept
the event loop alive. This caused background tasks to stay "running"
long after their output had completed.

Now awaits process exit with a 3s timeout on Windows (matching the
Unix path). Removes the 1s afterEach delay workaround from protocol
tests since close() properly cleans up on all platforms.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5a20ee3c8e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/cli.ts Outdated
Comment thread src/turns.ts Outdated
Kevin7Qi added 3 commits March 2, 2026 17:33
1. isProcessAlive: return true (assume alive) when PID file is
   missing instead of false. Prevents cmdJobs from marking live
   threads as interrupted when PID file write failed or predates
   the PID tracking feature.

2. Stale signal cleanup: use timestamp comparison instead of
   unconditional unlink. Only removes signal files older than the
   current process, preserving fresh kill requests that arrive
   between process start and poll start.
1. Kill signal poll: keep retrying on transient errors instead of
   clearing the interval and leaving a never-settling promise that
   silently disables kill monitoring for the rest of the turn.

2. isProcessAlive: log warning when PID file contains invalid content
   so corrupted files leave a diagnostic trail.

3. Add boundary test: fresh signal file (current mtime) is preserved
   at turn start and triggers interrupted status.

4. Fix stale signal test: backdate relative to process.uptime() instead
   of fixed 60s offset to avoid flakiness when process has been running
   longer than 60s (CI, watch mode).
When both kill mechanisms fail (signal file write + server interrupt),
keep the thread status as "running" so the user can retry. Previously
the status was unconditionally set to "interrupted", which blocked
subsequent kill attempts via the "already interrupted" guard.
@Kevin7Qi Kevin7Qi merged commit d0996a6 into main Mar 2, 2026
2 checks passed
@Kevin7Qi Kevin7Qi deleted the fix/kill-signal-ipc branch March 2, 2026 10:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant