fix(scheduler): recover orphaned mongo executions in ~30 min instead of ~20 h#1499
fix(scheduler): recover orphaned mongo executions in ~30 min instead of ~20 h#1499bingran-you wants to merge 1 commit intodevfrom
Conversation
…chdog The reconcile-stale-running-executions threshold was derived from the watchdog timeout (`run_task_timeout * 2 + 30s`, ~20 h with defaults), so executions orphaned by a worker restart or ACI crash stayed in `status=running` for up to 20 h before the periodic pass would reclaim them. In the meantime `execute_due_task` keeps deferring the task every 30 s, stranding user-facing replies indefinitely. Give reconciliation its own budget (default 30 min, override via `STALE_EXECUTION_RECONCILE_AFTER_SECS`). The watchdog still protects actively-running tasks; this change only affects recovery of abandoned rows. Refs #1459.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Still producing user-facing outages. Today's 15:15 UTC This is the 3rd recurrence today. Each sweep until this PR merges will keep needing manual The diff is 12+/1- lines; the test plan steps are clear. Would be happy to add an automated test that restarts the worker mid-execution and asserts reconciliation within 30 min if that unblocks review. |
|
Sixth consecutive day this regression has stranded live user emails in prod (see today's comment on #1459). 20 due tasks blocked, 6 different users without replies, worker restarted 118 times in 10 h. Bumping urgency — all checks are green ( |
|
Another recurrence today (2026-04-22 ~01:10 UTC) — prod had 30 orphaned Manually reconciled via This PR (shortening the threshold to ~30 min) would have auto-recovered within minutes instead of needing manual intervention. PR is Task IDs reconciled today (for correlation with logs): |
Summary
resolve_stale_execution_timeout()returnedrun_task_timeout * 2 + 30s(~20 h with defaults), so anytask_executionsrow orphaned by a worker restart or ACI crash stayedstatus=runningfor up to 20 h.execute_due_taskkeeps deferring the task every 30 s as long as that row exists, stranding user-facing replies.STALE_EXECUTION_RECONCILE_AFTER_SECS. The per-run watchdog is unchanged and continues to protect actively-running tasks.Why now
Today's scheduled scan caught two live stuck tasks (see #1459 comment):
86199c20-…(email →erics_toy@icloud.com): execution rowrunningfor ~2 h with no container alive; user waiting for a morning report.9c4498ae-…(wechat_mp): same pattern after a content_filter failure triggered a retry that got orphaned by a worker restart.Periodic reconciliation logged
failed=0 stale_after_secs=72030every 10 min because nothing was 20 h old — the budget was simply too generous.What changed
DEFAULT_STALE_RECONCILE_AFTER_SECS = 30 * 60.resolve_stale_execution_timeout()now readsSTALE_EXECUTION_RECONCILE_AFTER_SECSinstead of deriving fromrun_task_timeout.Test plan
cargo check -p scheduler_module— clean (only pre-existing warnings).periodic stale execution reconciliation completed owners_changed=N superseded=M failed=K— expectfailed >= 1over a 30–60 min window for the backlog of orphaned rows, then drop to 0.Rollback
Set
STALE_EXECUTION_RECONCILE_AFTER_SECS=72030in.envto restore prior behavior.Refs #1459.