fix(scheduler): tolerate already-finalized rows in record_execution_finish (#1459)#1516
Open
bingran-you wants to merge 2 commits intodevfrom
Open
fix(scheduler): tolerate already-finalized rows in record_execution_finish (#1459)#1516bingran-you wants to merge 2 commits intodevfrom
bingran-you wants to merge 2 commits intodevfrom
Conversation
…lized rows When the stale-execution reconciler (or the `dowhiz-service-debug` audit) finalizes a `task_executions` row ahead of the worker, the strict `status: "running"` filter in `record_execution_finish` yields `matched_count == 0` and the function returns `SchedulerError::Storage`. That error cascades into `after_execute_failed`, which counts as a retry and — for `OneShot` `RunTask` entries — eventually flips the task to `enabled=false`. The agent may have actually finished successfully; the user just never sees a reply. Observed today (2026-04-21) on prod `dowhiz_production_little_bear`: ``` ERROR scheduler task 7161c400-... for user 8574d066-... failed: storage error: missing running execution row ... ``` Task `7161c400` ended up `enabled=false` for an inbound email to `erics_toy@icloud.com`. Two more occurrences (`d04f410f`, `eaa96484`) landed at 19:02 UTC in the same sweep. This change treats `matched_count == 0` as best-effort: warn-log the condition and return `Ok(())`. The scheduler's intent (finalize the execution) is already satisfied by whichever path got there first, so surfacing a hard error here misrepresents a successful (or reconciled) run as a failure. Refs #1459.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Contributor
Author
|
Follow-up landed in
Added Mongo-backed regressions for both cases and reran:
Both passed locally. This reply was drafted by breeze, an autonomous agent running on behalf of the account owner. |
1 similar comment
Contributor
Author
|
Follow-up landed in
Added Mongo-backed regressions for both cases and reran:
Both passed locally. This reply was drafted by breeze, an autonomous agent running on behalf of the account owner. |
This was referenced Apr 22, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
record_execution_finishrequiredstatus: \"running\"in its Mongo filter. When the stale-execution reconciler or thedowhiz-service-debugaudit flips a row tofailed/supersededbefore the worker's finish call lands,matched_count == 0and the function returnsSchedulerError::Storage. That propagates intoafter_execute_failedand — forOneShotRunTask entries — eventually disables the task, even though the agent may have actually completed its work.Evidence
Today's production sweep (
dowhiz_production_little_bear,dw_worker118 restarts) surfaced three instances of this cascade — see #1459 audit comment:```
2026-04-21T17:29:37Z ERROR scheduler task 7161c400-d45d-54a6-26c1-530fb5432815
for user 8574d066-... failed: storage error: missing running execution row ...
2026-04-21T19:02:23Z ERROR scheduler task d04f410f-1e2c-1d35-3208-0f84890327ab
for user 2aedb09a-... failed: storage error: missing running execution row ...
2026-04-21T19:02:51Z ERROR scheduler task eaa96484-8ccc-692c-2060-6ce150904025
for user 8574d066-... failed: storage error: missing running execution row ...
```
Task `7161c400` is now `enabled=false` in prod Mongo (user inbound to `erics_toy@icloud.com`) — the direct user-visible outcome of this cascade.
Change
DoWhiz_service/scheduler_module/src/scheduler/store/mongo.rs: whenrecord_execution_finish's update matches no row, warn-log and returnOk(())instead of erroring. The scheduler's intent is already satisfied.reconcile_stale_running_executions_for_task'smatched_countassertion — that path has different invariants.Complementary to PR #1499
#1499 reduces how long a row can stay orphaned (20h → 30min). This PR prevents a successful agent run from being mis-reported as a scheduler failure when that reconciler wins the race.
Test plan
cargo check -p scheduler_module --lib— clean (only pre-existing warnings).runningrowfailedmid-task, confirm the worker logs the warn line andtasks.enabledis not flipped tofalse.missing running execution rowerror to disappear fromdw-worker-error.log.Refs #1459.