fix(scheduler): recover orphaned mongo executions in ~30 min instead of ~20 h by bingran-you · Pull Request #1499 · KnoWhiz/DoWhiz

bingran-you · 2026-04-21T02:13:40Z

Summary

resolve_stale_execution_timeout() returned run_task_timeout * 2 + 30s (~20 h with defaults), so any task_executions row orphaned by a worker restart or ACI crash stayed status=running for up to 20 h. execute_due_task keeps deferring the task every 30 s as long as that row exists, stranding user-facing replies.
This PR gives reconciliation its own budget — default 30 min, overridable via STALE_EXECUTION_RECONCILE_AFTER_SECS. The per-run watchdog is unchanged and continues to protect actively-running tasks.

Why now

Today's scheduled scan caught two live stuck tasks (see #1459 comment):

prod 86199c20-… (email → erics_toy@icloud.com): execution row running for ~2 h with no container alive; user waiting for a morning report.
staging 9c4498ae-… (wechat_mp): same pattern after a content_filter failure triggered a retry that got orphaned by a worker restart.

Periodic reconciliation logged failed=0 stale_after_secs=72030 every 10 min because nothing was 20 h old — the budget was simply too generous.

What changed

New const DEFAULT_STALE_RECONCILE_AFTER_SECS = 30 * 60.
resolve_stale_execution_timeout() now reads STALE_EXECUTION_RECONCILE_AFTER_SECS instead of deriving from run_task_timeout.

Test plan

cargo check -p scheduler_module — clean (only pre-existing warnings).
Deploy to staging, restart worker mid-task, confirm the orphaned row is reclaimed within 30 min and the task retries cleanly.
Production canary: watch periodic stale execution reconciliation completed owners_changed=N superseded=M failed=K — expect failed >= 1 over a 30–60 min window for the backlog of orphaned rows, then drop to 0.

Rollback

Set STALE_EXECUTION_RECONCILE_AFTER_SECS=72030 in .env to restore prior behavior.

Refs #1459.

…chdog The reconcile-stale-running-executions threshold was derived from the watchdog timeout (`run_task_timeout * 2 + 30s`, ~20 h with defaults), so executions orphaned by a worker restart or ACI crash stayed in `status=running` for up to 20 h before the periodic pass would reclaim them. In the meantime `execute_due_task` keeps deferring the task every 30 s, stranding user-facing replies indefinitely. Give reconciliation its own budget (default 30 min, override via `STALE_EXECUTION_RECONCILE_AFTER_SECS`). The watchdog still protects actively-running tasks; this change only affects recovery of abandoned rows. Refs #1459.

vercel · 2026-04-21T02:13:44Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
dowhiz	Ready	Preview, Comment	Apr 21, 2026 2:14am

bingran-you · 2026-04-21T15:14:32Z

Still producing user-facing outages. Today's 15:15 UTC dowhiz-service-debug sweep reconciled 10 more orphaned rows (8 prod, 2 staging) — see #1459 comment.

This is the 3rd recurrence today. Each sweep until this PR merges will keep needing manual updateMany({status:"running"}, {status:"failed"}) reconciliation plus issue-comment bookkeeping. Please prioritize review.

The diff is 12+/1- lines; the test plan steps are clear. Would be happy to add an automated test that restarts the worker mid-execution and asserts reconciliation within 30 min if that unblocks review.

bingran-you · 2026-04-21T20:18:22Z

Sixth consecutive day this regression has stranded live user emails in prod (see today's comment on #1459). 20 due tasks blocked, 6 different users without replies, worker restarted 118 times in 10 h.

Bumping urgency — all checks are green (rust / website / Vercel), mergeStateStatus=CLEAN, mergeable=MERGEABLE. The change is scoped to one helper and is safely rolled back with STALE_EXECUTION_RECONCILE_AFTER_SECS=72030. Requesting merge today.

bingran-you · 2026-04-22T01:14:18Z

Another recurrence today (2026-04-22 ~01:10 UTC) — prod had 30 orphaned running task_executions from ~21:55 UTC and 22:40 UTC, blocking cron-schedule jobs for 10 users for ~3 hours. Worker restart at 22:38 UTC lost the in-flight tracking; startup reconciliation's 72030s (20h) threshold didn't catch them.

Manually reconciled via dowhiz-service-debug audit script (30 rows → failed with reason). Also cleaned up 12 terminated ACI containers piled up in rg-dowhiz-oliver-dev (related to #1503).

This PR (shortening the threshold to ~30 min) would have auto-recovered within minutes instead of needing manual intervention. PR is MERGEABLE with all CI green — would appreciate a review so we can stop paying this on-call tax.

Task IDs reconciled today (for correlation with logs): 2e3ab3ef, be945d4a, c4e9314d, 15cc1487, d04f410f, e7f59a81, 3511dc6e, b0359ed4, 412caf85, 14fb2bd0, bca6001f, e5eda918, 8c17f153, 7b4bbc00, 00795eaa, b18efac0, eaa96484, 69cf89b6, ea5e93ca, e6716051, 9dbaab2c, 4a75ecb1, 14b2ae64, 07890318, df8f93d2, dd57d822.

bingran-you mentioned this pull request Apr 21, 2026

Scheduler hot-loops due tasks when Mongo still marks them running #1459

Open

vercel Bot deployed to Preview April 21, 2026 02:14 View deployment

bingran-you mentioned this pull request Apr 21, 2026

Staging worker hot-loop flooding Cosmos with 10k+ running task_executions #1505

Open

This was referenced Apr 21, 2026

fix(run_task): use ACI delete retry helper for cleanup (#1503) #1504

Open

fix(scheduler): tolerate already-finalized rows in record_execution_finish (#1459) #1516

Open

bingran-you added the breeze:human Breeze needs human input to proceed label Apr 21, 2026

bingran-you mentioned this pull request Apr 21, 2026

[P2] Prod ACI delete-request timeout 20s too short — containers pile up in rg-dowhiz-oliver-dev #1503

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(scheduler): recover orphaned mongo executions in ~30 min instead of ~20 h#1499

fix(scheduler): recover orphaned mongo executions in ~30 min instead of ~20 h#1499
bingran-you wants to merge 1 commit intodevfrom
bry/fix-stale-execution-threshold

bingran-you commented Apr 21, 2026

Uh oh!

vercel Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

bingran-you commented Apr 21, 2026

Uh oh!

bingran-you commented Apr 21, 2026

Uh oh!

bingran-you commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bingran-you commented Apr 21, 2026

Summary

Why now

What changed

Test plan

Rollback

Uh oh!

vercel Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bingran-you commented Apr 21, 2026

Uh oh!

bingran-you commented Apr 21, 2026

Uh oh!

bingran-you commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented Apr 21, 2026 •

edited

Loading