Skip to content

feat: SQLite worker coordination + output persistence (#89)#94

Merged
dean0x merged 14 commits intomainfrom
feat/worker-coordination-89
Mar 18, 2026
Merged

feat: SQLite worker coordination + output persistence (#89)#94
dean0x merged 14 commits intomainfrom
feat/worker-coordination-89

Conversation

@dean0x
Copy link
Owner

@dean0x dean0x commented Mar 17, 2026

Summary

  • Add workers table (migration v9) for cross-process worker coordination — tracks which workers exist across all processes sharing the same SQLite DB
  • Wire SQLiteOutputRepository into the live capture path via periodic 500ms flushes from ProcessConnector, enabling cross-process output visibility through TaskManager.getLogs()
  • Replace the 30-minute staleness heuristic in RecoveryManager with definitive PID-based crash detection (process.kill(pid, 0))
  • Use DB-based global worker count in ResourceMonitor.canSpawnWorker() to prevent cross-process over-spawning (settling workers still used for resource projections only)

Key design decisions

  • WorkerRepository interface (DIP) — follows established pattern (TaskRepo, DependencyRepo, ScheduleRepo, CheckpointRepo); services depend on interface, not concrete Database
  • Plain INSERT for register() — UNIQUE violation on task_id is a real coordination error (another process owns this task), not something to silently overwrite
  • Post-spawn INSERT — spawn first (inside existing spawnLock mutex), then INSERT; can't hold sync transaction across async spawn
  • Periodic save(), not per-chunk append()OutputRepository.append() does O(n) read-modify-write per call; periodic save() with INSERT OR REPLACE is O(1)
  • Required dependencies (no optional guards) — v1.0 breaking change; WorkerRepository and OutputRepository are required in all constructors

Edge cases addressed

  • Settling workers double-counting (split max-workers check vs resource projections)
  • Shutdown race (stopFlushing before kill prevents post-DB-close writes)
  • UNIQUE constraint on register detects cross-process task conflicts
  • In-memory buffer freed after final flush (prevents memory leak)
  • PID recycling acknowledged as theoretical, not actionable

Files changed (23 files, +1826 / -261)

Area Files
Domain types domain.ts, interfaces.ts
New implementation worker-repository.ts
Migration database.ts (v9)
Worker lifecycle event-driven-worker-pool.ts
Resource checks resource-monitor.ts
Recovery recovery-manager.ts
Output persistence process-connector.ts, task-manager.ts
Bootstrap wiring bootstrap.ts
Tests (13 files) New + updated unit/integration tests

Test plan

  • npm run build — clean compilation
  • npx biome check src/ tests/ — no lint issues
  • npm run test:repositories — WorkerRepository tests pass (12 tests)
  • npm run test:implementations — WorkerPool tests pass
  • npm run test:services — RecoveryManager, ProcessConnector, TaskManager tests pass
  • npm run test:integration — Cross-process coordination tests pass
  • npm run test:core — No regressions
  • Snyk code scan — 0 issues on new/modified files
  • All 1,278 tests passing across all groups

Add cross-process worker coordination via a `workers` table and wire
output persistence through the existing SQLiteOutputRepository.

Core changes:
- WorkerRegistration type + WorkerRepository interface (DIP pattern)
- Migration v9: workers table with FK to tasks, indexes on owner_pid/pid
- SQLiteWorkerRepository: plain INSERT (not REPLACE) for UNIQUE safety
- WorkerPool: register on spawn, unregister on kill/completion
- ResourceMonitor: DB-based global count for max workers check
  (settling workers still used for resource projections only)
- RecoveryManager: PID-based crash detection replaces 30-min heuristic
- ProcessConnector: periodic 500ms flush + final flush on exit + clear
- TaskManager.getLogs(): in-memory → DB fallback for cross-process reads
- Bootstrap: wire WorkerRepository + OutputRepository as required deps

Edge cases addressed:
- Settling workers double-counting (split max-workers vs projections)
- Shutdown race (stopFlushing before kill prevents post-close DB writes)
- UNIQUE constraint on register detects cross-process task conflicts
- In-memory buffer freed after final flush (prevents memory leak)
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 17, 2026

Confidence Score: 4/5

  • Safe to merge — all previously-flagged critical issues are addressed, design is sound, and 1 278 tests pass.
  • The implementation is thorough and well-tested. All five concerns raised in prior review threads (terminal-status guard, EPERM handling, unregister result checking, buffer-clear on flush failure, RUNNING task re-queue logic) are correctly resolved. The one point deducted reflects the deleteByOwnerPid dead-code surface area in the interface and the misleading inline comment about DB count timing, both of which are minor but worth addressing before the v1.0 interface is considered stable.
  • src/core/interfaces.ts (deleteByOwnerPid has no production caller) and src/implementations/resource-monitor.ts (misleading comment on line 88)

Important Files Changed

Filename Overview
src/implementations/worker-repository.ts New SQLite-backed repository for cross-process worker coordination. Uses prepared statements, Zod validation at the DB boundary, and plain INSERT (no REPLACE) for register() so UNIQUE violations surface as explicit coordination errors. Well-structured; minor concern that deleteByOwnerPid has no production caller.
src/services/recovery-manager.ts Replaces the 30-minute staleness heuristic with two-phase PID-based crash detection. Previous review concerns (terminal-status guard, EPERM handling, unregister result checking) are all addressed in this revision.
src/services/process-connector.ts Adds periodic output flushing (configurable interval), backpressure guard for in-flight flushes, and a prepareForKill() path. The .finally() pattern correctly frees the in-memory buffer regardless of flush outcome, addressing the previous review concern.
src/implementations/resource-monitor.ts Switches the max-workers cap from in-memory settling count to global DB count, enabling cross-process over-spawn prevention. Settling timestamps are correctly kept for CPU/memory projection. Comment on line 88 is slightly misleading about INSERT timing.
src/implementations/event-driven-worker-pool.ts Wires WorkerRepository registration into the spawn path and introduces cleanupWorkerState() shared by kill() and handleWorkerCompletion(). Registration failure correctly kills the just-spawned process and cleans in-memory state before connect() is called, avoiding orphaned intervals.
src/services/task-manager.ts Adds cross-process output visibility by falling back to the DB when the in-memory buffer is empty. The fallback logic is correct (in-memory has the full buffer; DB is the most recent flush snapshot). The totalSize discrepancy when tail slicing is applied to DB output was flagged in a prior review thread.
src/implementations/database.ts Migration v9 creates the workers table with UNIQUE(task_id) and ON DELETE CASCADE from tasks. Indexes on owner_pid and pid are appropriate for the recovery and monitoring queries.
src/bootstrap.ts Correctly wires SQLiteWorkerRepository and SQLiteOutputRepository as singletons, passing them through to SystemResourceMonitor, EventDrivenWorkerPool, TaskManagerService, and RecoveryManager.
src/core/interfaces.ts Adds WorkerRepository interface following the established repository DIP pattern. Minor concern: deleteByOwnerPid is included but has no production call site, adding unnecessary surface area to the interface.
src/core/configuration.ts Adds outputFlushIntervalMs (min 500ms, max 30s, default 5000ms) with OUTPUT_FLUSH_INTERVAL_MS env override. Note: the PR description describes "500ms flushes" but the actual default is 5s.
tests/unit/implementations/worker-repository.test.ts 12 unit tests covering register, unregister, find variants, getGlobalCount, and deleteByOwnerPid against a real in-process SQLite DB. Good coverage of the UNIQUE violation error path.
tests/integration/worker-pool-management.test.ts New integration tests cover register/unregister lifecycle on spawn and completion, global count tracking, and output persistence via ProcessConnector flush. All use mock repositories, which is appropriate for integration-layer tests.

Sequence Diagram

sequenceDiagram
    participant WH as WorkerHandler
    participant RM as ResourceMonitor
    participant WR as WorkerRepository (SQLite)
    participant PC as ProcessConnector
    participant OR as OutputRepository (SQLite)

    WH->>RM: canSpawnWorker()
    RM->>WR: getGlobalCount()
    WR-->>RM: count (cross-process truth)
    RM-->>WH: ok(true)

    WH->>WH: spawn process (async)
    WH->>WR: register(WorkerRegistration)
    Note over WH,WR: UNIQUE(task_id) — fails if another process owns task

    WH->>PC: connect(process, taskId, onExit)
    PC->>PC: setInterval(flushIntervalMs)

    loop every flushIntervalMs
        PC->>OR: save(taskId, buffer snapshot)
    end

    alt Process exits normally
        PC->>PC: stopFlushing(taskId)
        PC->>OR: flushOutput(taskId) [final flush]
        PC->>PC: outputCapture.clear(taskId) [.finally]
        PC->>WH: onExit(code)
        WH->>WR: unregister(workerId)
    else kill() called
        WH->>PC: prepareForKill(taskId)
        PC->>PC: stopFlushing(taskId)
        PC->>OR: flushOutput(taskId)
        WH->>WH: SIGTERM → process
        WH->>WR: unregister(workerId) [via cleanupWorkerState]
    end

    Note over WH,WR: On next startup — RecoveryManager
    WH->>WR: findAll()
    loop each WorkerRegistration
        WH->>WH: isProcessAlive(ownerPid)?
        alt dead PID
            WH->>WR: unregister(workerId)
            WH->>WH: repository.update(taskId, FAILED)
        end
    end
Loading

Last reviewed commit: "style: fix Biome for..."

Comment on lines +141 to +148
if (tail && tail > 0) {
return ok({
taskId: output.taskId,
stdout: output.stdout.slice(-tail),
stderr: output.stderr.slice(-tail),
totalSize: output.totalSize,
});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 totalSize is not updated after tail-slicing the DB result

When the tail parameter is applied to the DB output, stdout and stderr are sliced to tail entries, but totalSize still reflects the full persisted output. Callers that use totalSize to understand the volume of the returned data (e.g. for display or deciding whether to request more) will see an inflated count relative to what was actually returned.

return ok({
  taskId: output.taskId,
  stdout: output.stdout.slice(-tail),
  stderr: output.stderr.slice(-tail),
  // totalSize should reflect the sliced content, not the full persisted output
  totalSize: output.stdout.slice(-tail).reduce((sum, l) => sum + l.length, 0)
           + output.stderr.slice(-tail).reduce((sum, l) => sum + l.length, 0),
});

(Or simply keep totalSize as a "full size" sentinel by documenting that behaviour explicitly — but the current code is silent about the discrepancy.)

- Remove unused err/ok imports from worker-repository.ts
- Add error checking for unregister/update in recovery Phase 0
- Extract cleanupWorkerState() to DRY kill/completion paths
- Extract pruneExpiredTimestamps() to DRY settling cleanup

/**
* Database row type for workers table
*/
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pattern Violation: Missing Zod Row Validation Schema

Every other SQLite repository uses Zod to validate database rows at the system boundary (SQLiteTaskRepository has TaskRowSchema, SQLiteDependencyRepository has DependencyRowSchema, etc.). This repository uses unchecked as casts instead.

Impact: Data corruption or unexpected DB values will produce silent type mismatches rather than early parse errors. This breaks the "parse, don't validate" convention.

Fix: Add a WorkerRowSchema at the top of the file:

import { z } from "zod";

const WorkerRowSchema = z.object({
  worker_id: z.string().min(1),
  task_id: z.string().min(1),
  pid: z.number(),
  owner_pid: z.number(),
  agent: z.string(),
  started_at: z.number(),
});

Then validate in rowToRegistration():

private rowToRegistration(row: WorkerRow): WorkerRegistration {
  const data = WorkerRowSchema.parse(row);
  return {
    workerId: WorkerId(data.worker_id),
    taskId: TaskId(data.task_id),
    // ...
  };
}

Confidence: 95% (Critical pattern deviation from all 4 other repositories)

* Uses plain INSERT (NOT INSERT OR REPLACE) — UNIQUE violation on task_id
* means another process already owns this task, which is a real coordination error.
*/
register(registration: WorkerRegistration): Result<void> {
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pattern Violation: Missing operationErrorHandler

All other repositories use operationErrorHandler() from core/errors.js as the centralized error mapping function. This repository manually constructs BackbeatError inline in every method, which is more verbose and inconsistent.

Fix: Import and use operationErrorHandler:

import { operationErrorHandler } from '../core/errors.js';

// Example for register():
register(registration: WorkerRegistration): Result<void> {
  return tryCatch(
    () => { this.registerStmt.run({...}); },
    operationErrorHandler('register worker', { workerId: registration.workerId }),
  );
}

Note: The register() method has special UNIQUE constraint detection logic which may justify keeping a custom error mapper for that one method. But the other 6 methods should use operationErrorHandler.

Confidence: 90% (Critical pattern deviation from all 4 other repositories)

}

// Start periodic output flushing to DB (every 500ms)
const interval = setInterval(() => {
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded 500ms Flush Interval Not Configurable

The periodic output flush interval is hardcoded as 500 milliseconds with no Configuration override. With N concurrent workers, this generates N * (M / 0.5) DB write operations. For 5 workers running 10 minutes each, that is 6,000 save operations.

Impact: Heavy SQLite write amplification under load. Each save() does a full snapshot write, not an incremental append, which is wasteful when output hasn't changed.

Fix: Make the interval configurable via Configuration:

constructor(
  outputCapture: OutputCapture,
  logger: Logger,
  outputRepository: OutputRepository,
  configuration?: Configuration, // Add this
) {
  // ... existing code ...
  
  const flushIntervalMs = configuration?.get('output.flushInterval') ?? 5000;
  
  // Then use in the interval callback
  const interval = setInterval(() => {
    this.flushOutput(taskId).catch(/* ... */);
  }, flushIntervalMs);
}

Consider increasing default to 5-10 seconds (10x fewer writes) and adding a dirty flag to skip flushes when output hasn't changed.

Confidence: 95% (Blocking per performance and architecture reviews)

try {
process.kill(pid, 0);
return true;
} catch {
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recover() Method at 153 Lines with 4 Nesting Levels (3x Over Complexity Threshold)

This method handles three distinct recovery phases (dead worker cleanup, QUEUED task re-queue, RUNNING task PID check) in a single 153-line function with 4 levels of nesting. This exceeds the 50-line critical threshold by 3x and makes the method hard to test and understand.

Fix: Extract each phase into a named private method to preserve sequential ordering while making each phase independently readable and testable:

async recover(): Promise<Result<void>> {
  this.logger.info('Starting recovery process');
  this.cleanDeadWorkerRegistrations(); // Phase 0
  await this.cleanupOldCompletedTasks(); // Cleanup
  const queuedResult = await this.requeueQueuedTasks(); // Phase 1
  const failedResult = await this.failCrashedRunningTasks(); // Phase 2
  // ... summary log
  return ok(undefined);
}

private cleanDeadWorkerRegistrations(): void { /* Phase 0 logic */ }
private async cleanupOldCompletedTasks(): Promise<void> { /* cleanup */ }
private async requeueQueuedTasks(): Promise<{ count: number }> { /* Phase 1 */ }
private async failCrashedRunningTasks(): Promise<{ count: number }> { /* Phase 2 */ }

Confidence: 98% (Blocking per complexity review - metric violation)

this.processConnector = new ProcessConnector(outputCapture, logger, outputRepository);
}

async spawn(task: Task): Promise<Result<Worker>> {
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spawn() Method at 99 Lines with 8 Error Paths (Complexity Over Threshold)

The spawn() method now has 99 lines and handles agent resolution, resource checking, process spawning, DB registration with rollback, timeout setup, and output connection all in one method. Each cross-cutting concern adds another error path and cleanup responsibility.

Fix: Extract the DB registration + rollback into a helper, or group the post-spawn setup into a finalizeWorkerSetup method:

private finalizeWorkerSetup(
  worker: WorkerState,
  task: Task,
  childProcess: ChildProcess,
): Result<void> {
  const regResult = this.registerWorkerInDb(worker, task);
  if (!regResult.ok) {
    childProcess.kill('SIGTERM');
    this.workers.delete(worker.id);
    this.taskToWorker.delete(task.id);
    return err(regResult.error);
  }
  this.setupTimeoutForWorker(worker);
  this.processConnector.connect(childProcess, task.id, (exitCode) => {
    this.handleWorkerCompletion(task.id, exitCode ?? 0);
  });
  return ok(undefined);
}

Confidence: 95% (Blocking per complexity review)

package.json Outdated
"test:repositories": "NODE_OPTIONS='--max-old-space-size=2048' vitest run tests/unit/implementations/dependency-repository.test.ts tests/unit/implementations/task-repository.test.ts tests/unit/implementations/database.test.ts tests/unit/implementations/checkpoint-repository.test.ts tests/unit/implementations/output-repository.test.ts --no-file-parallelism",
"test:repositories": "NODE_OPTIONS='--max-old-space-size=2048' vitest run tests/unit/implementations/dependency-repository.test.ts tests/unit/implementations/task-repository.test.ts tests/unit/implementations/database.test.ts tests/unit/implementations/checkpoint-repository.test.ts tests/unit/implementations/output-repository.test.ts tests/unit/implementations/worker-repository.test.ts --no-file-parallelism",
"test:adapters": "NODE_OPTIONS='--max-old-space-size=2048' vitest run tests/unit/adapters --no-file-parallelism",
"test:implementations": "NODE_OPTIONS='--max-old-space-size=2048' vitest run tests/unit/implementations --exclude='**/dependency-repository.test.ts' --exclude='**/task-repository.test.ts' --exclude='**/database.test.ts' --exclude='**/checkpoint-repository.test.ts' --exclude='**/output-repository.test.ts' --no-file-parallelism",
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing Test Script Exclusion: worker-repository.test.ts

The new worker-repository.test.ts was added to the test:repositories script but NOT excluded from test:implementations. When running test:all, these tests execute twice, wasting time and memory.

Fix: Add the exclusion to the test:implementations script:

"test:implementations": "NODE_OPTIONS='--max-old-space-size=2048' vitest run tests/unit/implementations --exclude='**/dependency-repository.test.ts' --exclude='**/task-repository.test.ts' --exclude='**/database.test.ts' --exclude='**/checkpoint-repository.test.ts' --exclude='**/output-repository.test.ts' --exclude='**/worker-repository.test.ts' --no-file-parallelism",

Confidence: 95% (Test setup regression - impacts CI/memory)

@dean0x
Copy link
Owner Author

dean0x commented Mar 17, 2026

Code Review Summary: PR #94 - SQLite Worker Coordination

High-Confidence Blocking Issues (≥90%)

Inline comments have been created for 6 blocking issues:

  1. Missing Zod row validation schema (SQLiteWorkerRepository) - Consistency violation
  2. Missing operationErrorHandler usage (SQLiteWorkerRepository) - Consistency violation
  3. 500ms hardcoded flush interval (ProcessConnector) - Performance + configurability
  4. recover() at 153 lines, 4 nesting levels (RecoveryManager) - Complexity 3x over threshold
  5. spawn() at 99 lines, 8 error paths (EventDrivenWorkerPool) - Complexity over threshold
  6. Missing test exclusion (package.json) - Test setup regression

Medium/Lower-Confidence Findings (60-79%)

These are flagged in the summary but not as blocking inline comments:

Consolidation Strategy (Duplicate Issues Across Reviewers):

  • Mock factory duplication (9 test files): Identical createMockWorkerRepository() and createMockOutputRepository() factories copy-pasted across 9 files. Should be centralized in tests/fixtures/test-doubles.ts. This was flagged by 4 reviewers (architecture, consistency, tests, typescript). Recommendation: Address in follow-up if this PR pattern is replicated further.

  • OutputRepository interface in implementation layer (3 reviewers: architecture, consistency, typescript): Services import from src/implementations/output-repository.ts instead of src/core/interfaces.ts. Inconsistent with WorkerRepository and other repos. This PR amplifies the issue by adding 3 new cross-layer imports. Recommendation: Move interface to core/interfaces.ts in a separate PR to avoid scope creep.

  • Type safety: Two issues identified:

    • Non-null assertion on workerRegistration! (recovery-manager.ts:153) - Can be eliminated with direct null check
    • Type-only import missing for OutputRepository in event-driven-worker-pool.ts:18
  • Document drift: Three CRITICAL issues in architecture docs (EVENT_FLOW.md):

    • Recovery flow description now contradicts code (describes 30-min heuristic that was replaced)
    • Stale task detection safeguard section is outdated
    • Future improvements reference removed infrastructure
      Recommendation: Update docs to reflect PID-based crash detection before merge.
  • CLAUDE.md missing entries: File Locations table doesn't reference WorkerRepository or updated recovery manager. Database section missing workers table.

  • Data consistency: Double Date.now() calls in spawn() create slightly different timestamps for in-memory worker vs. DB registration.

Analysis by Review Discipline

Discipline HIGH MEDIUM LOW Status
Architecture 3 2 0 Changes needed
Complexity 2 2 2 Changes needed
Consistency 2 2 0 Changes needed
Database 0 2 1 Approved w/ conditions
Dependencies 0 1 0 Minor fix needed
Documentation 3 2 3 Changes needed
Performance 2 2 1 Changes needed
Regression 0 2 0 Changes needed
Security 0 2 1 Approved w/ conditions
Tests 1 2 2 Changes needed
TypeScript 2 2 0 Changes needed

Reviewer Consensus

All 11 reviewers flagged:

  • Blocking: Zod schema, operationErrorHandler, recover() complexity, spawn() complexity, 500ms flush interval
  • Should Fix: Mock factory consolidation, OutputRepository interface location
  • Document: EVENT_FLOW.md recovery sections need urgent rewrite

Recommended Merge Path

Do not merge without addressing:

  1. All 6 inline-comment issues (blocking consensus across reviews)
  2. EVENT_FLOW.md recovery documentation (CRITICAL - docs now contradict code)
  3. Mock factory consolidation (maintenance debt prevention)

Can defer to follow-up:

  • OutputRepository interface relocation (pre-existing pattern issue)
  • CLAUDE.md entries (documentation completeness)
  • Type safety improvements (non-critical improvements)

Inline comments: 6 created
Summary findings: 10+ medium-confidence items consolidated above
Deduplication: Many findings appeared in 2-4 reviews; consolidated here to avoid spam

Attribution: Claude Code review agent, compiled from architecture, complexity, consistency, database, dependencies, documentation, performance, regression, security, tests, and typescript reviews.

Dean Sharon and others added 5 commits March 17, 2026 23:39
Inline the hasLiveWorker boolean check into the if-condition so
TypeScript can narrow workerRegistration to non-null within the
block, eliminating the need for the ! assertion operator.

Co-Authored-By: Claude <noreply@anthropic.com>
…r consistency

Align SQLiteWorkerRepository with the pattern used by all 4 other
repositories (Task, Dependency, Schedule, Checkpoint):

- Add WorkerRowSchema (Zod) for system-boundary row validation in
  rowToRegistration(), replacing unchecked 'as' cast
- Replace 6 inline BackbeatError constructions with operationErrorHandler()
- Preserve register()'s custom error handler for UNIQUE constraint detection

Co-Authored-By: Claude <noreply@anthropic.com>
…te recovery docs

- Add --exclude='**/worker-repository.test.ts' to test:implementations
  script to prevent duplicate test execution in test:all
- Rewrite EVENT_FLOW.md Recovery Flow to document PID-based crash
  detection (replaces outdated 30-minute staleness heuristic)
- Update Stale Task Detection safeguard section to PID-Based Crash
  Detection with two-phase recovery description
- Remove stale detection from Future Improvements (already replaced)
- CLAUDE.md: add worker-repository.ts to File Locations table
- CLAUDE.md: document workers table (migration v9) in Database section

Co-Authored-By: Claude <noreply@anthropic.com>
…uard

- Add outputFlushIntervalMs to Configuration (default 5000ms, was hardcoded 500ms)
  Reduces DB write load 10x while maintaining cross-process output visibility.
  Configurable via env var OUTPUT_FLUSH_INTERVAL_MS or config file.

- Add backpressure guard to prevent overlapping flush writes. Uses a
  flushingInProgress Set to skip intervals when a previous flush is
  still in-flight. Cleaned up in stopFlushing() to prevent leaks.

- Wire config value through bootstrap -> EventDrivenWorkerPool -> ProcessConnector.

Co-Authored-By: Claude <noreply@anthropic.com>
Remove duplicated createMockWorkerRepository() and createMockOutputRepository()
factory functions from individual test files. All tests now import from the
shared tests/fixtures/mocks.ts, eliminating copy-paste duplication and
standardizing naming (createMockOutputRepo -> createMockOutputRepository).

Co-Authored-By: Claude <noreply@anthropic.com>
@dean0x dean0x mentioned this pull request Mar 17, 2026
16 tasks
Dean Sharon added 3 commits March 18, 2026 01:39
Extract stopFlushing + flushOutput sequence into prepareForKill() method
on ProcessConnector. Updates EventDrivenWorkerPool.kill() to use the new
single call instead of two separate calls with inline comments.

Closes tech debt item #7 from PR #94 review.
Extract 4 private methods from the 153-line recover() orchestrator:
- cleanDeadWorkerRegistrations(): Phase 0 PID-based crash detection
- cleanupOldCompletedTasks(): Phase 1 old task cleanup
- recoverQueuedTasks(): Phase 2 re-queue with duplicate check
- recoverRunningTasks(): Phase 3 PID-based RUNNING task recovery

Pure extract-method refactor — zero behavior change, all 24 recovery
tests pass unmodified.

Closes tech debt item #5 from PR #94 review.
Documents RUNNING tasks being marked FAILED on first startup after
migration 9 (empty workers table = no live worker = crash detection
triggers). Includes mitigation steps and required dependency changes.

Closes tech debt item #4 from PR #94 review.
Dean Sharon added 3 commits March 18, 2026 11:56
…l status overwrite

Two P1 data-correctness fixes in recovery:

1. EPERM handling: process.kill(pid, 0) throws EPERM when a process exists
   but the caller lacks signal permission. The catch-all previously returned
   false, treating live processes as dead. Now EPERM returns true.

2. Terminal state guard: Both Phase 0 (cleanDeadWorkerRegistrations) and
   Phase 3 (recoverRunningTasks) called repo.update(FAILED) without checking
   if the task had already reached a terminal state. A TOCTOU race could
   overwrite COMPLETED/CANCELLED with FAILED. Now both phases call findById
   + isTerminalState before updating.
outputCapture.clear(taskId) was chained in .then() after flushOutput().
If flush rejected, .then was skipped and the buffer leaked. Moved clear()
into .finally() so memory is freed regardless of flush outcome.
@dean0x dean0x merged commit a40ceac into main Mar 18, 2026
2 checks passed
@dean0x dean0x deleted the feat/worker-coordination-89 branch March 18, 2026 11:49
dean0x pushed a commit that referenced this pull request Mar 20, 2026
- Bump version 0.5.0 → 0.6.0
- Update release notes with all 8 PRs (was missing #85, #86, #91, #94, #100, #106, #107)
- Mark v0.6.0 as released in ROADMAP.md
- Update FEATURES.md architecture section for hybrid event model
- Expand "What's New in v0.6.0" with architectural simplification, bug fixes, tech debt
- Fix README roadmap: v0.6.1 → v0.7.0 for loops
- Update bug report template example version to 0.6.0
@dean0x dean0x mentioned this pull request Mar 20, 2026
7 tasks
dean0x added a commit that referenced this pull request Mar 20, 2026
## Summary

- Bump version `0.5.0` → `0.6.0` (package.json + package-lock.json)
- Expand release notes with all 8 PRs (#78, #85, #86, #91, #94, #100,
#106, #107) — was only covering #78
- Mark v0.6.0 as released in ROADMAP.md, update status and version
timeline
- Update FEATURES.md architecture section for hybrid event model (was
describing old fully event-driven architecture with removed services)
- Expand "What's New in v0.6.0" in FEATURES.md with architectural
simplification, additional bug fixes, tech debt, breaking changes,
migration 9
- Fix README roadmap version: `v0.6.1` → `v0.7.0` for task/pipeline
loops
- Update bug report template example version `0.5.0` → `0.6.0`

### GitHub Issues
- Closed #82 (cancelTasks scope — PR #106)
- Closed #95 (totalSize tail-slicing — PR #106)
- Updated #105 release tracker checklist (all items checked)

## Test plan
- [x] `npm run build` — clean compilation
- [x] `npm run test:all` — full suite passes (822 tests, 0 failures)
- [x] `npx biome check src/ tests/` — no lint errors
- [x] `package.json` version is `0.6.0`
- [x] Release notes file exists and covers all PRs
- [ ] After merge: trigger Release workflow from GitHub Actions
- [ ] After release published: close #105

---------

Co-authored-by: Dean Sharon <deanshrn@gmain.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant