Skip to content

feat(heartbeat): automatic diagnostic retry for failed runs#1

Merged
Noesis-Boss merged 1 commit intomasterfrom
diagnostic-retry
Apr 23, 2026
Merged

feat(heartbeat): automatic diagnostic retry for failed runs#1
Noesis-Boss merged 1 commit intomasterfrom
diagnostic-retry

Conversation

@Noesis-Boss
Copy link
Copy Markdown
Owner

Summary

When a run fails due to an adapter error (network timeout, rate limit, API error, etc.), Paperclip now automatically classifies the failure and can retry it up to 2 times using the heartbeat service as a diagnostic agent.

What changed

New column: diagnosticRetryCount

  • Added to heartbeat_runs via migration 0046_curious_deadpool.sql
  • Tracks how many diagnostic retries have been attempted for a run
  • Max 2 retries (count=0 first attempt, count=1 second attempt, count=2 stops)

New function: maybeEnqueueDiagnosticRetry in heartbeat.ts

After every failed run in finalizeCompletedRun, this function:

  1. Checks if diagnosticRetryCount < 2
  2. Classifies the failure by parsing error, exitCode, and signal
  3. If canRetry = true (transient errors only), enqueues a diagnostic retry run
  4. Updates diagnosticRetryCount on the new run

Failure classification (classifyFailure)

Each failure is categorized and marked retryable or not:

Category Retryable Reason
process_killed SIGKILL/SIGTERM
rate_limit API rate limit
timeout Network timeout
network_error Connection reset, EOF errors
eof_error Empty response
not_found Missing file
permission_denied Permissions issue
oom Out of memory
unknown Unclassified

No changes for process_lost

The existing processLossRetryCount retry mechanism is unchanged.

Testing

  • Migration applied, server restarts successfully
  • TypeScript compiles (pre-existing upstream errors unrelated to this PR)

- Add diagnosticRetryCount column to heartbeat_runs table
- Add maybeEnqueueDiagnosticRetry function that runs after each failed run
- Diagnostic agent analyzes error (signal, exitCode, error message) and classifies failure type
- Retry is enqueued as a new run with context from the failed run (task, session, workspace)
- Max 2 diagnostic retries per failed run before giving up
- Includes migration for diagnostic_retry_count column
- Future upstream updates: rebase origin/master and re-apply migration if needed
@Noesis-Boss Noesis-Boss merged commit 96a1bbf into master Apr 23, 2026
1 of 3 checks passed
@Noesis-Boss Noesis-Boss deleted the diagnostic-retry branch April 23, 2026 10:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant