Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 67 additions & 0 deletions docs/plans/2026-03-06-agentty-reliability-implementation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Agentty Reliability Implementation Plan

> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

**Goal:** Fix the highest-priority reliability and CLI contract issues in `agentty` without broad refactoring.

**Architecture:** Tighten the session startup contract so `start` preserves argv boundaries and only reports success once the PTY is ready. Add cross-process state serialization so concurrent CLI invocations do not lose session records. Validate `attach` targets and make `kill` fail when the session has not actually exited.

**Tech Stack:** TypeScript, Vitest, node-pty, execa

---

### Task 1: Preserve argv boundaries and delay `start` success until PTY readiness
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix heading level increment.

The static analysis tool flagged that heading levels should increment by one level at a time. The document jumps from # (h1) to ### (h3), skipping ## (h2).

📝 Proposed fix
-### Task 1: Preserve argv boundaries and delay `start` success until PTY readiness
+## Task 1: Preserve argv boundaries and delay `start` success until PTY readiness

Apply the same change to Tasks 2, 3, and 4 (lines 31, 43, 59).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
### Task 1: Preserve argv boundaries and delay `start` success until PTY readiness
## Task 1: Preserve argv boundaries and delay `start` success until PTY readiness
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 13-13: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3

(MD001, heading-increment)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/plans/2026-03-06-agentty-reliability-implementation.md` at line 13, The
document jumps heading levels from H1 to H3; change the "### Task 1: Preserve
argv boundaries and delay `start` success until PTY readiness" heading to H2
(i.e., "## Task 1...") to restore proper incremental heading levels, and apply
the same change to the headings for Task 2, Task 3, and Task 4 (use the exact
heading texts "Task 2:", "Task 3:", "Task 4:" in the file) so each task heading
is H2 and the outline increments by one level at a time.


**Files:**
- Modify: `src/index.ts`
- Modify: `src/sessionRuntime.ts`
- Modify: `src/worker.ts`
- Modify: `src/ipc.ts`
- Test: `tests/sessionRuntime.start.test.ts`
- Test: `tests/e2e.start-argv.test.ts`

**Steps:**
1. Write a failing test that proves `start` preserves quoted and empty argv entries.
2. Run the targeted test and confirm it fails for the expected reason.
3. Write a failing test that proves `startSession()` returns the PTY pid rather than the worker pid.
4. Run the targeted test and confirm it fails.
5. Implement the minimal `file + args[]` start contract and a worker readiness handshake.
6. Re-run the targeted tests until green.

### Task 2: Prevent session state loss across concurrent CLI invocations

**Files:**
- Modify: `src/state.ts`
- Test: `tests/e2e.concurrent-start.test.ts`

**Steps:**
1. Write a failing concurrency test that starts multiple sessions in parallel and asserts all session records remain present.
2. Run the targeted test and confirm it fails.
3. Add minimal cross-process state serialization around shared state mutations.
4. Re-run the targeted test until green.

### Task 3: Tighten `attach` validation and `kill` semantics

**Files:**
- Modify: `src/resolveSession.ts`
- Modify: `src/sessionRuntime.ts`
- Modify: `tests/attach.test.ts`
- Modify: `tests/kill.test.ts`

**Steps:**
1. Write failing tests for attaching nonexistent or exited sessions.
2. Run the targeted test and confirm it fails.
3. Write a failing unit test that proves `killSession()` should reject when exit confirmation never arrives.
4. Run the targeted test and confirm it fails.
5. Implement minimal validation for `attach` and make `kill` timeout explicit.
6. Re-run the targeted tests until green.

### Task 4: Verify the full suite

**Files:**
- Test: `tests/*.test.ts`

**Steps:**
1. Run the full test suite in the same environment used for real `agentty` socket access.
2. Confirm exit code `0` and zero failing tests.
3. Review diffs for unintended changes before reporting completion.
22 changes: 19 additions & 3 deletions src/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,10 @@ interface GetOptionResult extends SessionOptionResult {

interface StartOptionResult {
command: string;
args?: string[];
cwd: string;
name?: string;
displayCommand: string;
}

export interface CliIo {
Expand Down Expand Up @@ -125,6 +127,18 @@ function parseGetOptions(args: string[]): GetOptionResult {
};
}

function formatCommandForDisplay(commandParts: string[]): string {
return commandParts
.map((part) => {
if (part.length === 0 || /[\s"'\\]/.test(part)) {
return JSON.stringify(part);
}

return part;
})
.join(' ');
}

function parseStartOptions(args: string[]): StartOptionResult {
let cwd = process.cwd();
let name: string | undefined;
Expand Down Expand Up @@ -177,9 +191,11 @@ function parseStartOptions(args: string[]): StartOptionResult {
}

return {
command: commandParts.join(' '),
command: commandParts[0],
args: commandParts.length > 1 ? commandParts.slice(1) : undefined,
cwd,
name,
displayCommand: formatCommandForDisplay(commandParts),
};
}

Expand Down Expand Up @@ -262,8 +278,8 @@ export async function runCli(argv: string[] = process.argv.slice(2), io: CliIo =
}

if (command === 'start') {
const { command: startCommand, cwd, name } = parseStartOptions(argv.slice(1));
const session = await startSession({ command: startCommand, cwd, name });
const { command: startCommand, args, cwd, name, displayCommand } = parseStartOptions(argv.slice(1));
const session = await startSession({ command: startCommand, args, cwd, name, displayCommand });

io.stdout(session.id);
return;
Expand Down
11 changes: 10 additions & 1 deletion src/resolveSession.ts
Original file line number Diff line number Diff line change
@@ -1,6 +1,15 @@
import { readActiveSessionId, writeActiveSessionId } from './state';
import { readActiveSessionId, readSessionById, writeActiveSessionId } from './state';

async function ensureRunningSession(sessionId: string): Promise<void> {
const session = await readSessionById(sessionId);

if (!session || session.status !== 'running') {
throw new Error(`session is not running: ${sessionId}`);
}
}

export async function attachSession(id: string): Promise<void> {
await ensureRunningSession(id);
await writeActiveSessionId(id);
}

Expand Down
78 changes: 67 additions & 11 deletions src/sessionRuntime.ts
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,10 @@ import {

export interface StartSessionInput {
command: string;
args?: string[];
cwd: string;
name?: string;
displayCommand?: string;
}

export interface SessionMetadata {
Expand All @@ -37,8 +39,8 @@ const WORKER_ENTRY_PATH = path.resolve(__dirname, '../dist/worker.js');
const LOGS_DIR = 'logs';
const KILL_WAIT_TIMEOUT_MS = 3_000;
const KILL_WAIT_INTERVAL_MS = 50;
const SOCKET_READY_TIMEOUT_MS = 1_000;
const SOCKET_READY_POLL_INTERVAL_MS = 50;
const START_READY_TIMEOUT_MS = 10_000;
const START_READY_POLL_INTERVAL_MS = 50;

function isUnavailableIpcError(error: unknown): boolean {
const code = (error as NodeJS.ErrnoException)?.code;
Expand Down Expand Up @@ -85,26 +87,28 @@ async function markSessionExited(sessionId: string, exitCode: number | null): Pr
}
}

async function waitForExited(sessionId: string): Promise<void> {
async function waitForExited(sessionId: string): Promise<boolean> {
const deadline = Date.now() + KILL_WAIT_TIMEOUT_MS;

while (Date.now() < deadline) {
const session = await readSessionById(sessionId);

if (!session || session.status === 'exited') {
return;
return true;
}

await new Promise((resolve) => setTimeout(resolve, KILL_WAIT_INTERVAL_MS));
}

return false;
}

function getWorkerLogPath(sessionId: string): string {
return path.join(getStateRoot(), LOGS_DIR, `${sessionId}.log`);
}

async function waitForSocketReady(socketPath: string, didWorkerExit: () => boolean): Promise<boolean> {
const deadline = Date.now() + SOCKET_READY_TIMEOUT_MS;
const deadline = Date.now() + START_READY_TIMEOUT_MS;

while (Date.now() < deadline) {
try {
Expand All @@ -118,7 +122,7 @@ async function waitForSocketReady(socketPath: string, didWorkerExit: () => boole
break;
}

await new Promise((resolve) => setTimeout(resolve, SOCKET_READY_POLL_INTERVAL_MS));
await new Promise((resolve) => setTimeout(resolve, START_READY_POLL_INTERVAL_MS));
}

try {
Expand All @@ -129,7 +133,37 @@ async function waitForSocketReady(socketPath: string, didWorkerExit: () => boole
}
}

export async function startSession({ command, cwd, name }: StartSessionInput): Promise<SessionMetadata> {
async function waitForSessionReady(
sessionId: string,
workerPid: number,
didWorkerExit: () => boolean,
): Promise<SessionMetadata | null> {
const deadline = Date.now() + START_READY_TIMEOUT_MS;

while (Date.now() < deadline) {
const session = await readSessionById(sessionId);

if (
session &&
session.status === 'running' &&
typeof session.pid === 'number' &&
typeof session.workerPid === 'number' &&
session.pid !== workerPid
) {
return session as SessionMetadata;
}

if (didWorkerExit()) {
break;
}

await new Promise((resolve) => setTimeout(resolve, START_READY_POLL_INTERVAL_MS));
}

return null;
}

export async function startSession({ command, args, cwd, name, displayCommand }: StartSessionInput): Promise<SessionMetadata> {
const trimmedCommand = command.trim();

if (!trimmedCommand) {
Expand All @@ -149,9 +183,11 @@ export async function startSession({ command, cwd, name }: StartSessionInput): P
const workerSpec = {
id: sessionId,
command: trimmedCommand,
...(args ? { args } : {}),
cwd,
socketPath,
startedAt: now,
...(displayCommand ? { displayCommand } : {}),
...(name ? { name } : {}),
};

Expand Down Expand Up @@ -195,7 +231,7 @@ export async function startSession({ command, cwd, name }: StartSessionInput): P
id: sessionId,
pid: child.pid,
workerPid: child.pid,
command: trimmedCommand,
command: displayCommand ?? trimmedCommand,
cwd,
startedAt: now,
lastActiveAt: now,
Expand Down Expand Up @@ -229,13 +265,29 @@ export async function startSession({ command, cwd, name }: StartSessionInput): P
await markSessionExited(sessionId, workerExitCode);

throw new Error(
`session worker failed to start (socket was not created within ${SOCKET_READY_TIMEOUT_MS}ms): ${socketPath}. Check worker log: ${logFilePath}`,
`session worker failed to start (socket was not created within ${START_READY_TIMEOUT_MS}ms): ${socketPath}. Check worker log: ${logFilePath}`,
);
}

const readySession = await waitForSessionReady(sessionId, child.pid, () => workerExited);

if (!readySession) {
try {
process.kill(child.pid, 'SIGTERM');
} catch {
// ignore cleanup errors
}

await markSessionExited(sessionId, workerExitCode);

throw new Error(
`session worker failed to become ready within ${START_READY_TIMEOUT_MS}ms: ${socketPath}. Check worker log: ${logFilePath}`,
);
Comment on lines +272 to 285
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Orphan pty on lock failure 🐞 Bug ⛯ Reliability

If session persistence during worker startup fails (e.g., state lock timeout) or the parent SIGTERMs
the worker during the new readiness wait, the worker can exit before installing SIGTERM/SIGINT
handlers and without calling requestKill(), potentially leaving the PTY process running orphaned.
Agent Prompt
## Issue description
The worker can spawn a PTY and then fail/exit before it installs SIGTERM handlers or calls `requestKill()`. With the new state-lock timeouts and the new `startSession()` readiness timeout that SIGTERMs the worker, this can leave orphan PTY processes.

## Issue Context
- Worker spawns PTY before awaiting `persistRunning()`.
- State persistence now can fail due to `withStateLock()` timeout.
- Worker top-level catch exits without killing PTY.

## Fix Focus Areas
- src/worker.ts[330-408]
- src/state.ts[79-125]
- src/sessionRuntime.ts[272-286]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

}

child.unref();

return session;
return readySession;
}

export async function sendText(sessionId: string, payload: string): Promise<void> {
Expand Down Expand Up @@ -301,7 +353,11 @@ export async function killSession(sessionId: string): Promise<void> {
await requestIpc(session.socketPath, {
method: 'kill',
});
await waitForExited(sessionId);
const exited = await waitForExited(sessionId);

if (!exited) {
throw new Error(`session did not exit within ${KILL_WAIT_TIMEOUT_MS}ms: ${sessionId}`);
}
} catch (error) {
if (isUnavailableIpcError(error)) {
await markSessionExited(sessionId, null);
Expand Down
Loading