-
Notifications
You must be signed in to change notification settings - Fork 0
fix: harden agentty session lifecycle #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
pinion05
wants to merge
1
commit into
main
Choose a base branch
from
fix-agentty-session-reliability
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
67 changes: 67 additions & 0 deletions
67
docs/plans/2026-03-06-agentty-reliability-implementation.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,67 @@ | ||
| # Agentty Reliability Implementation Plan | ||
|
|
||
| > **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task. | ||
|
|
||
| **Goal:** Fix the highest-priority reliability and CLI contract issues in `agentty` without broad refactoring. | ||
|
|
||
| **Architecture:** Tighten the session startup contract so `start` preserves argv boundaries and only reports success once the PTY is ready. Add cross-process state serialization so concurrent CLI invocations do not lose session records. Validate `attach` targets and make `kill` fail when the session has not actually exited. | ||
|
|
||
| **Tech Stack:** TypeScript, Vitest, node-pty, execa | ||
|
|
||
| --- | ||
|
|
||
| ### Task 1: Preserve argv boundaries and delay `start` success until PTY readiness | ||
|
|
||
| **Files:** | ||
| - Modify: `src/index.ts` | ||
| - Modify: `src/sessionRuntime.ts` | ||
| - Modify: `src/worker.ts` | ||
| - Modify: `src/ipc.ts` | ||
| - Test: `tests/sessionRuntime.start.test.ts` | ||
| - Test: `tests/e2e.start-argv.test.ts` | ||
|
|
||
| **Steps:** | ||
| 1. Write a failing test that proves `start` preserves quoted and empty argv entries. | ||
| 2. Run the targeted test and confirm it fails for the expected reason. | ||
| 3. Write a failing test that proves `startSession()` returns the PTY pid rather than the worker pid. | ||
| 4. Run the targeted test and confirm it fails. | ||
| 5. Implement the minimal `file + args[]` start contract and a worker readiness handshake. | ||
| 6. Re-run the targeted tests until green. | ||
|
|
||
| ### Task 2: Prevent session state loss across concurrent CLI invocations | ||
|
|
||
| **Files:** | ||
| - Modify: `src/state.ts` | ||
| - Test: `tests/e2e.concurrent-start.test.ts` | ||
|
|
||
| **Steps:** | ||
| 1. Write a failing concurrency test that starts multiple sessions in parallel and asserts all session records remain present. | ||
| 2. Run the targeted test and confirm it fails. | ||
| 3. Add minimal cross-process state serialization around shared state mutations. | ||
| 4. Re-run the targeted test until green. | ||
|
|
||
| ### Task 3: Tighten `attach` validation and `kill` semantics | ||
|
|
||
| **Files:** | ||
| - Modify: `src/resolveSession.ts` | ||
| - Modify: `src/sessionRuntime.ts` | ||
| - Modify: `tests/attach.test.ts` | ||
| - Modify: `tests/kill.test.ts` | ||
|
|
||
| **Steps:** | ||
| 1. Write failing tests for attaching nonexistent or exited sessions. | ||
| 2. Run the targeted test and confirm it fails. | ||
| 3. Write a failing unit test that proves `killSession()` should reject when exit confirmation never arrives. | ||
| 4. Run the targeted test and confirm it fails. | ||
| 5. Implement minimal validation for `attach` and make `kill` timeout explicit. | ||
| 6. Re-run the targeted tests until green. | ||
|
|
||
| ### Task 4: Verify the full suite | ||
|
|
||
| **Files:** | ||
| - Test: `tests/*.test.ts` | ||
|
|
||
| **Steps:** | ||
| 1. Run the full test suite in the same environment used for real `agentty` socket access. | ||
| 2. Confirm exit code `0` and zero failing tests. | ||
| 3. Review diffs for unintended changes before reporting completion. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -15,8 +15,10 @@ import { | |
|
|
||
| export interface StartSessionInput { | ||
| command: string; | ||
| args?: string[]; | ||
| cwd: string; | ||
| name?: string; | ||
| displayCommand?: string; | ||
| } | ||
|
|
||
| export interface SessionMetadata { | ||
|
|
@@ -37,8 +39,8 @@ const WORKER_ENTRY_PATH = path.resolve(__dirname, '../dist/worker.js'); | |
| const LOGS_DIR = 'logs'; | ||
| const KILL_WAIT_TIMEOUT_MS = 3_000; | ||
| const KILL_WAIT_INTERVAL_MS = 50; | ||
| const SOCKET_READY_TIMEOUT_MS = 1_000; | ||
| const SOCKET_READY_POLL_INTERVAL_MS = 50; | ||
| const START_READY_TIMEOUT_MS = 10_000; | ||
| const START_READY_POLL_INTERVAL_MS = 50; | ||
|
|
||
| function isUnavailableIpcError(error: unknown): boolean { | ||
| const code = (error as NodeJS.ErrnoException)?.code; | ||
|
|
@@ -85,26 +87,28 @@ async function markSessionExited(sessionId: string, exitCode: number | null): Pr | |
| } | ||
| } | ||
|
|
||
| async function waitForExited(sessionId: string): Promise<void> { | ||
| async function waitForExited(sessionId: string): Promise<boolean> { | ||
| const deadline = Date.now() + KILL_WAIT_TIMEOUT_MS; | ||
|
|
||
| while (Date.now() < deadline) { | ||
| const session = await readSessionById(sessionId); | ||
|
|
||
| if (!session || session.status === 'exited') { | ||
| return; | ||
| return true; | ||
| } | ||
|
|
||
| await new Promise((resolve) => setTimeout(resolve, KILL_WAIT_INTERVAL_MS)); | ||
| } | ||
|
|
||
| return false; | ||
| } | ||
|
|
||
| function getWorkerLogPath(sessionId: string): string { | ||
| return path.join(getStateRoot(), LOGS_DIR, `${sessionId}.log`); | ||
| } | ||
|
|
||
| async function waitForSocketReady(socketPath: string, didWorkerExit: () => boolean): Promise<boolean> { | ||
| const deadline = Date.now() + SOCKET_READY_TIMEOUT_MS; | ||
| const deadline = Date.now() + START_READY_TIMEOUT_MS; | ||
|
|
||
| while (Date.now() < deadline) { | ||
| try { | ||
|
|
@@ -118,7 +122,7 @@ async function waitForSocketReady(socketPath: string, didWorkerExit: () => boole | |
| break; | ||
| } | ||
|
|
||
| await new Promise((resolve) => setTimeout(resolve, SOCKET_READY_POLL_INTERVAL_MS)); | ||
| await new Promise((resolve) => setTimeout(resolve, START_READY_POLL_INTERVAL_MS)); | ||
| } | ||
|
|
||
| try { | ||
|
|
@@ -129,7 +133,37 @@ async function waitForSocketReady(socketPath: string, didWorkerExit: () => boole | |
| } | ||
| } | ||
|
|
||
| export async function startSession({ command, cwd, name }: StartSessionInput): Promise<SessionMetadata> { | ||
| async function waitForSessionReady( | ||
| sessionId: string, | ||
| workerPid: number, | ||
| didWorkerExit: () => boolean, | ||
| ): Promise<SessionMetadata | null> { | ||
| const deadline = Date.now() + START_READY_TIMEOUT_MS; | ||
|
|
||
| while (Date.now() < deadline) { | ||
| const session = await readSessionById(sessionId); | ||
|
|
||
| if ( | ||
| session && | ||
| session.status === 'running' && | ||
| typeof session.pid === 'number' && | ||
| typeof session.workerPid === 'number' && | ||
| session.pid !== workerPid | ||
| ) { | ||
| return session as SessionMetadata; | ||
| } | ||
|
|
||
| if (didWorkerExit()) { | ||
| break; | ||
| } | ||
|
|
||
| await new Promise((resolve) => setTimeout(resolve, START_READY_POLL_INTERVAL_MS)); | ||
| } | ||
|
|
||
| return null; | ||
| } | ||
|
|
||
| export async function startSession({ command, args, cwd, name, displayCommand }: StartSessionInput): Promise<SessionMetadata> { | ||
| const trimmedCommand = command.trim(); | ||
|
|
||
| if (!trimmedCommand) { | ||
|
|
@@ -149,9 +183,11 @@ export async function startSession({ command, cwd, name }: StartSessionInput): P | |
| const workerSpec = { | ||
| id: sessionId, | ||
| command: trimmedCommand, | ||
| ...(args ? { args } : {}), | ||
| cwd, | ||
| socketPath, | ||
| startedAt: now, | ||
| ...(displayCommand ? { displayCommand } : {}), | ||
| ...(name ? { name } : {}), | ||
| }; | ||
|
|
||
|
|
@@ -195,7 +231,7 @@ export async function startSession({ command, cwd, name }: StartSessionInput): P | |
| id: sessionId, | ||
| pid: child.pid, | ||
| workerPid: child.pid, | ||
| command: trimmedCommand, | ||
| command: displayCommand ?? trimmedCommand, | ||
| cwd, | ||
| startedAt: now, | ||
| lastActiveAt: now, | ||
|
|
@@ -229,13 +265,29 @@ export async function startSession({ command, cwd, name }: StartSessionInput): P | |
| await markSessionExited(sessionId, workerExitCode); | ||
|
|
||
| throw new Error( | ||
| `session worker failed to start (socket was not created within ${SOCKET_READY_TIMEOUT_MS}ms): ${socketPath}. Check worker log: ${logFilePath}`, | ||
| `session worker failed to start (socket was not created within ${START_READY_TIMEOUT_MS}ms): ${socketPath}. Check worker log: ${logFilePath}`, | ||
| ); | ||
| } | ||
|
|
||
| const readySession = await waitForSessionReady(sessionId, child.pid, () => workerExited); | ||
|
|
||
| if (!readySession) { | ||
| try { | ||
| process.kill(child.pid, 'SIGTERM'); | ||
| } catch { | ||
| // ignore cleanup errors | ||
| } | ||
|
|
||
| await markSessionExited(sessionId, workerExitCode); | ||
|
|
||
| throw new Error( | ||
| `session worker failed to become ready within ${START_READY_TIMEOUT_MS}ms: ${socketPath}. Check worker log: ${logFilePath}`, | ||
| ); | ||
|
Comment on lines
+272
to
285
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 1. Orphan pty on lock failure If session persistence during worker startup fails (e.g., state lock timeout) or the parent SIGTERMs the worker during the new readiness wait, the worker can exit before installing SIGTERM/SIGINT handlers and without calling requestKill(), potentially leaving the PTY process running orphaned. Agent Prompt
|
||
| } | ||
|
|
||
| child.unref(); | ||
|
|
||
| return session; | ||
| return readySession; | ||
| } | ||
|
|
||
| export async function sendText(sessionId: string, payload: string): Promise<void> { | ||
|
|
@@ -301,7 +353,11 @@ export async function killSession(sessionId: string): Promise<void> { | |
| await requestIpc(session.socketPath, { | ||
| method: 'kill', | ||
| }); | ||
| await waitForExited(sessionId); | ||
| const exited = await waitForExited(sessionId); | ||
|
|
||
| if (!exited) { | ||
| throw new Error(`session did not exit within ${KILL_WAIT_TIMEOUT_MS}ms: ${sessionId}`); | ||
| } | ||
| } catch (error) { | ||
| if (isUnavailableIpcError(error)) { | ||
| await markSessionExited(sessionId, null); | ||
|
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix heading level increment.
The static analysis tool flagged that heading levels should increment by one level at a time. The document jumps from
#(h1) to###(h3), skipping##(h2).📝 Proposed fix
Apply the same change to Tasks 2, 3, and 4 (lines 31, 43, 59).
📝 Committable suggestion
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)
[warning] 13-13: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3
(MD001, heading-increment)
🤖 Prompt for AI Agents