Skip to content

fix(scheduler): recover orphaned mongo executions in ~30 min instead of ~20 h#1499

Open
bingran-you wants to merge 1 commit intodevfrom
bry/fix-stale-execution-threshold
Open

fix(scheduler): recover orphaned mongo executions in ~30 min instead of ~20 h#1499
bingran-you wants to merge 1 commit intodevfrom
bry/fix-stale-execution-threshold

Conversation

@bingran-you
Copy link
Copy Markdown
Contributor

Summary

  • resolve_stale_execution_timeout() returned run_task_timeout * 2 + 30s (~20 h with defaults), so any task_executions row orphaned by a worker restart or ACI crash stayed status=running for up to 20 h. execute_due_task keeps deferring the task every 30 s as long as that row exists, stranding user-facing replies.
  • This PR gives reconciliation its own budget — default 30 min, overridable via STALE_EXECUTION_RECONCILE_AFTER_SECS. The per-run watchdog is unchanged and continues to protect actively-running tasks.

Why now

Today's scheduled scan caught two live stuck tasks (see #1459 comment):

  • prod 86199c20-… (email → erics_toy@icloud.com): execution row running for ~2 h with no container alive; user waiting for a morning report.
  • staging 9c4498ae-… (wechat_mp): same pattern after a content_filter failure triggered a retry that got orphaned by a worker restart.

Periodic reconciliation logged failed=0 stale_after_secs=72030 every 10 min because nothing was 20 h old — the budget was simply too generous.

What changed

  • New const DEFAULT_STALE_RECONCILE_AFTER_SECS = 30 * 60.
  • resolve_stale_execution_timeout() now reads STALE_EXECUTION_RECONCILE_AFTER_SECS instead of deriving from run_task_timeout.

Test plan

  • cargo check -p scheduler_module — clean (only pre-existing warnings).
  • Deploy to staging, restart worker mid-task, confirm the orphaned row is reclaimed within 30 min and the task retries cleanly.
  • Production canary: watch periodic stale execution reconciliation completed owners_changed=N superseded=M failed=K — expect failed >= 1 over a 30–60 min window for the backlog of orphaned rows, then drop to 0.

Rollback

Set STALE_EXECUTION_RECONCILE_AFTER_SECS=72030 in .env to restore prior behavior.

Refs #1459.

…chdog

The reconcile-stale-running-executions threshold was derived from the
watchdog timeout (`run_task_timeout * 2 + 30s`, ~20 h with defaults),
so executions orphaned by a worker restart or ACI crash stayed in
`status=running` for up to 20 h before the periodic pass would reclaim
them. In the meantime `execute_due_task` keeps deferring the task every
30 s, stranding user-facing replies indefinitely.

Give reconciliation its own budget (default 30 min, override via
`STALE_EXECUTION_RECONCILE_AFTER_SECS`). The watchdog still protects
actively-running tasks; this change only affects recovery of
abandoned rows.

Refs #1459.
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 21, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
dowhiz Ready Ready Preview, Comment Apr 21, 2026 2:14am

@bingran-you
Copy link
Copy Markdown
Contributor Author

Still producing user-facing outages. Today's 15:15 UTC dowhiz-service-debug sweep reconciled 10 more orphaned rows (8 prod, 2 staging) — see #1459 comment.

This is the 3rd recurrence today. Each sweep until this PR merges will keep needing manual updateMany({status:"running"}, {status:"failed"}) reconciliation plus issue-comment bookkeeping. Please prioritize review.

The diff is 12+/1- lines; the test plan steps are clear. Would be happy to add an automated test that restarts the worker mid-execution and asserts reconciliation within 30 min if that unblocks review.

@bingran-you
Copy link
Copy Markdown
Contributor Author

Sixth consecutive day this regression has stranded live user emails in prod (see today's comment on #1459). 20 due tasks blocked, 6 different users without replies, worker restarted 118 times in 10 h.

Bumping urgency — all checks are green (rust / website / Vercel), mergeStateStatus=CLEAN, mergeable=MERGEABLE. The change is scoped to one helper and is safely rolled back with STALE_EXECUTION_RECONCILE_AFTER_SECS=72030. Requesting merge today.

@bingran-you
Copy link
Copy Markdown
Contributor Author

Another recurrence today (2026-04-22 ~01:10 UTC) — prod had 30 orphaned running task_executions from ~21:55 UTC and 22:40 UTC, blocking cron-schedule jobs for 10 users for ~3 hours. Worker restart at 22:38 UTC lost the in-flight tracking; startup reconciliation's 72030s (20h) threshold didn't catch them.

Manually reconciled via dowhiz-service-debug audit script (30 rows → failed with reason). Also cleaned up 12 terminated ACI containers piled up in rg-dowhiz-oliver-dev (related to #1503).

This PR (shortening the threshold to ~30 min) would have auto-recovered within minutes instead of needing manual intervention. PR is MERGEABLE with all CI green — would appreciate a review so we can stop paying this on-call tax.

Task IDs reconciled today (for correlation with logs): 2e3ab3ef, be945d4a, c4e9314d, 15cc1487, d04f410f, e7f59a81, 3511dc6e, b0359ed4, 412caf85, 14fb2bd0, bca6001f, e5eda918, 8c17f153, 7b4bbc00, 00795eaa, b18efac0, eaa96484, 69cf89b6, ea5e93ca, e6716051, 9dbaab2c, 4a75ecb1, 14b2ae64, 07890318, df8f93d2, dd57d822.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breeze:human Breeze needs human input to proceed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant