Skip to content

Stuck jobs: 383a376e and 95720726 (126+ min running) #27

@FrederikHandberg

Description

@FrederikHandberg

Incident

Two jobs marked as running for 126+ minutes each (well over 1hr timeout):

  • Job 383a376e: created 2026-04-13T20:32:44Z (GAUNTLET v0 CLI build task)
  • Job 95720726: created 2026-04-13T20:34:36Z (pide-overnight chain step-0)

Action taken

TIER 1 AUTO-FIX: Killed tmux sessions job-383a376e and job-95720726. Processes were confirmed alive but unresponsive.

Job 95720726 is particularly concerning — it's marked as part of the pide-overnight chain, and the chain YAML shows step-0 as completed at 2026-04-13T20:38:18Z, yet the job remained running for an additional 2 hours.

Root cause candidates

  1. Worker.py completion detection failed (sentinel check missed the exit)
  2. Job YAML status not updated by worker on completion
  3. Tmux session became orphaned from parent worker process
  4. Reaper script didn't run or was ineffective

Next steps

  • Review worker.py completion detection in /apps/listen/worker.py
  • Check reaper script logs (com.jensen.reap-stuck-jobs launchd)
  • Verify chain-advance.py state updates

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions