fix: v0.6.0 correctness bugs (#84, #82, #95) by dean0x · Pull Request #106 · dean0x/backbeat

dean0x · 2026-03-19T16:47:55Z

Summary

Fixes three correctness bugs discovered in v0.6.0 scheduled pipelines:

fix: RecoveryManager should check dependencies before re-enqueuing QUEUED tasks #84: RecoveryManager dependency check + TaskFailed emission
cancelTasks on CancelSchedule only covers latest execution #82: CancelSchedule scope — cancel tasks from all active executions
P2: totalSize not updated after tail-slicing in TaskManager.getOutput #95: totalSize recalculation after tail-slicing output

Changes

RecoveryManager Dependency Check (#84)

Added early validation to ensure TaskFailed event recipient has RecoveryManager
Prevents downstream errors when trying to emit TaskFailed without recovery support

CancelSchedule Execution Scope (#82)

Fixed CancelSchedule to cancel tasks from ALL active schedule executions
Previously only cancelled tasks from the most recent execution
Ensures cleanup is complete across all execution instances

Output Tail-Slicing totalSize Fix (#95)

Recalculate totalSize metadata after tail-slicing output
Previously kept stale totalSize from before slicing, causing mismatch

Testing

All existing tests pass (597 total)
Core logic tested via integration tests
No breaking changes

Related Issues

Closes #84, #82, #95

Two correctness bugs in crash recovery: 1. recoverQueuedTasks() blindly re-enqueued tasks without checking DependencyRepository.isBlocked(). Blocked pipeline tasks would execute out of order after restart. Now checks dependencies and skips blocked tasks (fail-safe: skips on DB error too). 2. recoverRunningTasks() and cleanDeadWorkerRegistrations() marked crashed tasks as FAILED but never emitted TaskFailed events. DependencyHandler never resolved dependencies, leaving downstream pipeline tasks blocked forever. Now emits TaskFailed after each successful status update.

getExecutionHistory(scheduleId, 1) only returned the latest execution. Older overlapping CRON runs with in-flight tasks were missed. Now fetches all executions, filters to 'triggered' status (active runs), and cancels tasks across all of them. Completed/failed/missed/skipped executions are skipped to avoid unnecessary no-op cancellation events.

After stdout.slice(-tail), totalSize still reflected the full pre-slice size. Affected two code paths: 1. TaskManager.getLogs() DB fallback — now recalculates from sliced arrays 2. BufferedOutputCapture.getOutput() — recalculates when tail was applied, preserves byte-based buffer.totalSize for non-tail case

- Check TaskFailed emit result in RecoveryManager (2 sites) and log on failure instead of silently swallowing errors - Extract linesSize helper to avoid inline reduce duplication - Use top-level TaskId import instead of inline import() in schedule-manager - Use flatMap for cleaner active execution task collection - Format multi-line constructor in integration test

greptile-apps · 2026-03-19T16:50:56Z

Confidence Score: 4/5

Safe to merge — all three bug fixes are correct and well-tested; one ordering edge-case in recovery is a known architectural trade-off, not a regression.
The fixes are logically sound and each is covered by both unit and integration tests. The linesByteSize extraction resolves the previously flagged duplication and units inconsistency. The only open concern is that QUEUED tasks blocked on RUNNING crashed tasks (no dead-worker registration) depend on the Phase 3 TaskFailed → DependencyHandler event chain executing while handlers are live — a scenario not covered by the new tests and acknowledged as an architectural edge case.
src/services/recovery-manager.ts — Phase 2/3 ordering for tasks blocked on RUNNING crashed tasks without a worker registration

Important Files Changed

Filename	Overview
src/services/recovery-manager.ts	Adds DependencyRepository injection, dependency-blocked check before re-queuing QUEUED tasks, and TaskFailed event emission after marking crashed/dead-worker tasks FAILED. Logic is sound; conservative fallback on dep-check error is a good call.
src/services/schedule-manager.ts	Widens CancelSchedule to iterate all active (status='triggered') executions instead of only the latest one, correctly cancelling tasks across all overlapping runs. Filtering happens in-process after a DEFAULT_LIMIT=100 DB query (noted in prior review thread).
src/implementations/output-capture.ts	BufferedOutputCapture now recalculates totalSize via linesByteSize when a tail is applied; TestOutputCapture also switched to linesByteSize. Both paths now use consistent UTF-8 byte counting.
src/utils/output.ts	New shared utility exporting linesByteSize, eliminating the duplicated helper that previously lived in both output-capture.ts and task-manager.ts.
src/services/task-manager.ts	getLogs now recalculates totalSize from the sliced arrays when a tail is applied to DB output, matching the fix in BufferedOutputCapture.
src/bootstrap.ts	Adds DependencyRepository argument to RecoveryManager constructor; DependencyRepository was already imported and registered in the container.
tests/integration/task-persistence.test.ts	New integration test verifies that blocked tasks are skipped during recovery while unblocked tasks are enqueued. Covers the new dependency-check path end-to-end against a real SQLite database.
tests/unit/services/recovery-manager.test.ts	Adds comprehensive unit tests for blocked-task skipping, conservative enqueue on dep-check failure, mixed blocked/unblocked scenarios, and TaskFailed event emission for both dead-worker and crashed-task paths.
tests/unit/services/schedule-manager.test.ts	Two new tests verify that cancelSchedule targets tasks from all triggered executions and skips completed executions. Coverage is thorough.

Sequence Diagram

sequenceDiagram
    participant B as Bootstrap
    participant RM as RecoveryManager
    participant DR as DependencyRepo
    participant TR as TaskRepository
    participant Q as TaskQueue
    participant EB as EventBus
    participant DH as DependencyHandler

    B->>RM: recover()
    note over RM: Phase 0 — Dead Worker Cleanup
    RM->>TR: findById(reg.taskId)
    RM->>TR: update(taskId, FAILED)
    RM->>EB: emit('TaskFailed') [NEW]
    EB-->>DH: handleTaskFailed → resolveDependency

    note over RM: Phase 2 — Recover QUEUED tasks
    RM->>TR: findByStatus(QUEUED)
    loop each queued task
        RM->>Q: contains(task.id)?
        RM->>DR: isBlocked(task.id) [NEW]
        alt blocked
            RM-->>RM: skip (blockedCount++)
        else not blocked / error
            RM->>Q: enqueue(task)
            RM->>EB: emit('TaskQueued')
        end
    end

    note over RM: Phase 3 — Recover RUNNING tasks
    RM->>TR: findByStatus(RUNNING)
    loop each running task
        RM->>TR: findById (TOCTOU guard)
        RM->>TR: update(taskId, FAILED)
        RM->>EB: emit('TaskFailed') [NEW]
        EB-->>DH: handleTaskFailed → resolves deps for QUEUED tasks skipped in Phase 2
    end

_{Last reviewed commit: "fix: invert isBlocke..."}

greptile-apps · 2026-03-19T16:51:00Z

src/services/schedule-manager.ts

+      const historyResult = await this.scheduleRepository.getExecutionHistory(scheduleId);
+      if (historyResult.ok) {
+        const activeExecutions = historyResult.value.filter((e) => e.status === 'triggered');


Active executions silently capped by DEFAULT_LIMIT = 100

getExecutionHistory(scheduleId) is called here without an explicit limit, so the underlying SQLite query falls back to DEFAULT_LIMIT = 100 (ordered scheduled_for DESC). For a high-frequency schedule (e.g., every minute) that has accumulated more than 100 historical records, any "triggered" executions beyond the 100 most recent will never be returned — and therefore their tasks will never be cancelled.

The prior code fetched only the single latest execution, so this PR widens the window but leaves the tail open. A safer approach would be a purpose-built query (e.g., getActiveExecutions) that filters by status = 'triggered' at the database level with no limit, or at minimum pass a sufficiently large explicit limit:

const historyResult = await this.scheduleRepository.getExecutionHistory(scheduleId);

Consider filtering at the SQL layer instead:

// e.g. new repository method: // getActiveExecutions(scheduleId): Promise<Result<readonly ScheduleExecution[]>> // SELECT * FROM schedule_executions WHERE schedule_id = ? AND status = 'triggered'

This avoids loading and JS-filtering up to 100 rows just to find the subset that are still active.

src/implementations/output-capture.ts

src/services/task-manager.ts

dean0x · 2026-03-19T16:51:46Z

src/implementations/output-capture.ts

 import { err, ok, Result } from '../core/result.js';

+/** Sum the character lengths of all lines in an array */
+function linesSize(lines: readonly string[]): number {


BLOCKING: Duplicated linesSize utility function

The identical linesSize() helper is now defined in two separate files: here at line 13 AND in src/services/task-manager.ts:33. All 8 code reviews flagged this as a DRY violation and maintenance risk.

Impact: If the calculation logic ever needs to change (e.g., to account for newlines or switch to byte-length measurement), both sites must be updated in sync. It's easy to miss one and introduce divergence.

Fix: Extract to a shared utility file (src/utils/output.ts) and import from both:

// src/utils/output.ts /** Sum the character lengths of all lines in an array */ export function linesSize(lines: readonly string[]): number { return lines.reduce((sum, line) => sum + line.length, 0); }

Then remove both duplicate definitions and import in their place.

8 reviewers flagged this issue across architecture, complexity, consistency, performance, regression, security, and TypeScript reviews

dean0x · 2026-03-19T16:51:56Z

src/implementations/output-capture.ts

+    const frozenStdout = Object.freeze([...stdout]);
+    const frozenStderr = Object.freeze([...stderr]);
+    const wasTailSliced = tail !== undefined && tail > 0;
+    const totalSize = wasTailSliced


BLOCKING: totalSize uses character length but capture() uses byte length

The recalculated totalSize here uses linesSize() which sums line.length (character count), but the capture() method at line 51 accumulates totalSize using Buffer.byteLength(data, 'utf8') (byte count).

The Problem: For ASCII-only output these are identical, but for multi-byte characters (emoji, CJK, accented chars), the values diverge. When tail-slicing is applied, totalSize reflects character length. When NOT applied (non-tail code path at line 120), it reflects byte length. This means the same field returns different units depending on code path — a latent correctness bug.

Impact (6 reviewers flagged this): Consumers relying on totalSize comparisons (e.g., process-connector.ts:143) may behave differently depending on which path was taken. For Claude Code output (mostly ASCII) this is unlikely to cause bugs today, but it's a semantic inconsistency introduced by this fix.

Fix: Choose ONE unit consistently. Either:

Use Buffer.byteLength in linesSize to match capture(), OR

Use string.length in capture() to match linesSize()

Given the maxBufferSize limit uses byte-length, byte-length is likely correct:

function linesSize(lines: readonly string[]): number { return lines.reduce((sum, line) => sum + Buffer.byteLength(line, 'utf8'), 0); }

Architecture, consistency, regression, security, and TypeScript reviews flagged this

dean0x · 2026-03-19T16:52:20Z

tests/unit/services/recovery-manager.test.ts

+        error: expect.objectContaining({ message: 'Worker process died (dead PID detected)' }),
+        exitCode: -1,
+      });
+    });


HIGH CONFIDENCE: Missing error path tests for TaskFailed emission failures

The tests above verify successful TaskFailed emission (happy paths), but the production code in recovery-manager.ts has error-handling branches (lines 129-132 and 271-274) that are NOT tested.

The Gap: When eventBus.emit('TaskFailed', ...) returns an error, the code logs it and continues recovery. But there are no tests verifying:

The error message is logged correctly

Recovery doesn't abort when event emission fails

Impact: If the emit-failure handler is accidentally removed or broken, no test would catch it. Given this is error-path logic (recovery during system failure), it deserves explicit coverage.

Missing Tests:

// For crashed task path it('should log error but continue when TaskFailed emit fails for crashed task', async () => { // Setup: taskFailed emit returns error eventBus.emit.mockResolvedValueOnce(err(emitError)); const result = await manager.recover(); expect(result.ok).toBe(true); // Recovery continues expect(logger.error).toHaveBeenCalledWith('Failed to emit TaskFailed event', ...); }); // For dead worker path it('should log error but continue when TaskFailed emit fails for dead worker', async () => { // Similar pattern });

Tests review identified this gap: 2 error branches, 0 tests covering them

dean0x · 2026-03-19T16:52:36Z

src/services/recovery-manager.ts

          });
+
+          // Emit TaskFailed so DependencyHandler resolves deps for downstream tasks
+          const failedEmitResult = await this.eventBus.emit('TaskFailed', {


Summary: 3 Inline Comments Posted + Low-Confidence Findings

Comments Created

Inline comments (≥80% reviewer confidence, blocking):

Duplicated linesSize function (line 13, output-capture.ts) — 8 reviewers flagged

totalSize byte-vs-character inconsistency (line 118, output-capture.ts) — 6 reviewers flagged

Missing TaskFailed error path tests (line 617, recovery-manager.test.ts) — Tests reviewer flagged

Lower-Confidence Findings (60-79% confidence)

These were flagged by 1-2 reviewers only; consolidated into summary rather than individual inline comments:

Should Fix (Medium confidence)

TestOutputCapture uses different totalSize calculation — Should align with BufferedOutputCapture using shared linesSize helper

cancelSchedule unbounded execution fetch (schedule-manager.ts:183) — Now defaults to 100 rows instead of 1 (the bug fix), but could add SQL-level filter to avoid fetch-then-filter pattern

RecoveryManager constructor parameter count (6 params) — Approaching threshold; consider configuration object if another param is added

Complexity: cleanDeadWorkerRegistrations nesting depth — Extracting emitTaskFailed helper would reduce this

Performance: Sequential N dependency checks in recovery loop — Current scale acceptable (N typically single digits); can batch if scale grows

Pre-existing Issues (Not introduced by this PR)

Output capture doesn't fail explicitly on error (empty catch at line 84)

worker-handler.ts:445 emits TaskFailed with new Error() instead of BackbeatError

Tail tests lack explicit expect(output.ok).toBe(true) guards

TOCTOU window in dependency check (mitigated by queue deduplication)

What the PR Does Well

✓ Fixes 4 real correctness bugs (dependency checking, multi-execution cancellation, totalSize recalc, output recovery)
✓ Proper error handling with Result types
✓ DependencyRepository injection via interface (DIP pattern)
✓ Good test coverage for fixed behaviors (unit + integration)
✓ All consumers of RecoveryManager constructor updated

Recommendation

CHANGES REQUESTED on the 3 blocking issues. Once addressed, this PR improves correctness significantly.

Generated by Claude Code review analysis — 8 comprehensive reviews spanning architecture, complexity, consistency, performance, regression, security, tests, and TypeScript

Add 2 tests verifying that TaskFailed emit failures in both recovery paths (dead worker cleanup and crashed running task) are logged but do not prevent recovery from completing. Add ARCHITECTURE EXCEPTION comments documenting the double-write pattern in both code paths. Co-Authored-By: Claude <noreply@anthropic.com>

Validates that RecoveryManager skips re-queuing QUEUED tasks with unresolved dependencies using real SQLiteDependencyRepository. The existing unit test covers this with mocks; this integration test exercises the full SQL path (addDependency -> isBlocked -> recovery). Co-Authored-By: Claude <noreply@anthropic.com>

Extract linesSize from output-capture.ts and task-manager.ts into a shared linesByteSize utility in src/utils/output.ts. The new function uses Buffer.byteLength (UTF-8 bytes) instead of string.length (characters), fixing a unit inconsistency where: - capture() tracked totalSize in bytes via Buffer.byteLength - getOutput() tail-slicing recalculated totalSize in characters - TestOutputCapture always used characters Multi-byte content (emoji, CJK) now returns consistent byte values regardless of whether tail-slicing was applied. Fixes: output-capture:51:correctness, output-capture:213:consistency Resolves: output-capture:13:dry (deliberate duplication superseded by correctness fix requiring shared implementation) Co-Authored-By: Claude <noreply@anthropic.com>

src/services/recovery-manager.ts

…tasks When dependencyRepo.isBlocked() errors during recovery, fall through to enqueue instead of skipping. The previous skip behavior could permanently strand dependency-free tasks since TaskUnblocked would never fire for them.

greptile-apps · 2026-03-19T21:29:08Z

src/services/recovery-manager.ts

+    const { queuedCount, blockedCount } = await this.recoverQueuedTasks(queuedResult.value);
    const failedCount = await this.recoverRunningTasks(runningResult.value);


QUEUED tasks blocked on RUNNING crashed tasks may be stranded

recoverQueuedTasks (Phase 2) runs before recoverRunningTasks (Phase 3). Consider a QUEUED task B that depends on a RUNNING (crashed) task A that has no dead-worker registration:

Phase 2 — isBlocked(B) returns true → B is skipped. ✓

Phase 3 — Task A is marked FAILED and TaskFailed is emitted.

If DependencyHandler is live it resolves the dependency and emits TaskUnblocked, re-queuing B via the normal event chain. ✓

If DependencyHandler is not yet wired at recovery time, B is never re-queued.

Phase 0 (cleanDeadWorkerRegistrations) avoids this for tasks backed by a dead worker registration because it runs first, and the subsequent findByStatus(RUNNING) no longer includes those tasks. The gap is for RUNNING tasks whose worker row is simply missing (not detected in Phase 0). The architecture comment acknowledges handler-readiness uncertainty, but this specific interplay isn't called out.

A low-cost safety net would be a second pass over still-blocked tasks after Phase 3 completes, re-queuing any whose dependency is now resolved:

// After recoverRunningTasks — re-check tasks that were skipped as blocked for (const task of blockedTasks) { const stillBlocked = await this.dependencyRepo.isBlocked(task.id); if (stillBlocked.ok && !stillBlocked.value) { this.queue.enqueue(task); } }

Not blocking merge, but worth tracking.

- Bump version 0.5.0 → 0.6.0 - Update release notes with all 8 PRs (was missing #85, #86, #91, #94, #100, #106, #107) - Mark v0.6.0 as released in ROADMAP.md - Update FEATURES.md architecture section for hybrid event model - Expand "What's New in v0.6.0" with architectural simplification, bug fixes, tech debt - Fix README roadmap: v0.6.1 → v0.7.0 for loops - Update bug report template example version to 0.6.0

## Summary - Bump version `0.5.0` → `0.6.0` (package.json + package-lock.json) - Expand release notes with all 8 PRs (#78, #85, #86, #91, #94, #100, #106, #107) — was only covering #78 - Mark v0.6.0 as released in ROADMAP.md, update status and version timeline - Update FEATURES.md architecture section for hybrid event model (was describing old fully event-driven architecture with removed services) - Expand "What's New in v0.6.0" in FEATURES.md with architectural simplification, additional bug fixes, tech debt, breaking changes, migration 9 - Fix README roadmap version: `v0.6.1` → `v0.7.0` for task/pipeline loops - Update bug report template example version `0.5.0` → `0.6.0` ### GitHub Issues - Closed #82 (cancelTasks scope — PR #106) - Closed #95 (totalSize tail-slicing — PR #106) - Updated #105 release tracker checklist (all items checked) ## Test plan - [x] `npm run build` — clean compilation - [x] `npm run test:all` — full suite passes (822 tests, 0 failures) - [x] `npx biome check src/ tests/` — no lint errors - [x] `package.json` version is `0.6.0` - [x] Release notes file exists and covers all PRs - [ ] After merge: trigger Release workflow from GitHub Actions - [ ] After release published: close #105 --------- Co-authored-by: Dean Sharon <deanshrn@gmain.com>

Dean Sharon added 4 commits March 19, 2026 10:16

greptile-apps bot reviewed Mar 19, 2026

View reviewed changes

dean0x commented Mar 19, 2026

View reviewed changes

Dean Sharon and others added 5 commits March 19, 2026 19:13

style: minor comment and spacing refinements from Simplifier

cb3f4a5

style: fix biome formatting and import ordering

5e1e610

greptile-apps bot reviewed Mar 19, 2026

View reviewed changes

src/services/recovery-manager.ts Outdated Show resolved Hide resolved

Dean Sharon added 2 commits March 19, 2026 22:38

style: fix ternary formatting in output-capture

3b1b49c

dean0x merged commit f195205 into main Mar 19, 2026
2 checks passed

dean0x deleted the fix/v060-correctness-bugs branch March 19, 2026 21:29

greptile-apps bot reviewed Mar 19, 2026

View reviewed changes

dean0x mentioned this pull request Mar 20, 2026

chore: prepare v0.6.0 release #108

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: v0.6.0 correctness bugs (#84, #82, #95)#106

fix: v0.6.0 correctness bugs (#84, #82, #95)#106
dean0x merged 11 commits intomainfrom
fix/v060-correctness-bugs

dean0x commented Mar 19, 2026

Uh oh!

greptile-apps bot commented Mar 19, 2026 •

edited

Loading

Uh oh!

greptile-apps bot Mar 19, 2026

Uh oh!

Uh oh!

Uh oh!

dean0x Mar 19, 2026

Uh oh!

dean0x Mar 19, 2026

Uh oh!

dean0x Mar 19, 2026

Uh oh!

dean0x Mar 19, 2026

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		const { queuedCount, blockedCount } = await this.recoverQueuedTasks(queuedResult.value);
		const failedCount = await this.recoverRunningTasks(runningResult.value);

Conversation

dean0x commented Mar 19, 2026

Summary

Changes

RecoveryManager Dependency Check (#84)

CancelSchedule Execution Scope (#82)

Output Tail-Slicing totalSize Fix (#95)

Testing

Related Issues

Uh oh!

greptile-apps bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dean0x Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

dean0x Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

dean0x Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

dean0x Mar 19, 2026

Choose a reason for hiding this comment

Summary: 3 Inline Comments Posted + Low-Confidence Findings

Comments Created

Lower-Confidence Findings (60-79% confidence)

Should Fix (Medium confidence)

Pre-existing Issues (Not introduced by this PR)

What the PR Does Well

Recommendation

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps bot commented Mar 19, 2026 •

edited

Loading