fix(scheduler): tolerate already-finalized rows in record_execution_finish (#1459) by bingran-you · Pull Request #1516 · KnoWhiz/DoWhiz

bingran-you · 2026-04-21T19:17:12Z

Summary

record_execution_finish required status: \"running\" in its Mongo filter. When the stale-execution reconciler or the dowhiz-service-debug audit flips a row to failed/superseded before the worker's finish call lands, matched_count == 0 and the function returns SchedulerError::Storage. That propagates into after_execute_failed and — for OneShot RunTask entries — eventually disables the task, even though the agent may have actually completed its work.

Evidence

Today's production sweep (dowhiz_production_little_bear, dw_worker 118 restarts) surfaced three instances of this cascade — see #1459 audit comment:

```
2026-04-21T17:29:37Z ERROR scheduler task 7161c400-d45d-54a6-26c1-530fb5432815
for user 8574d066-... failed: storage error: missing running execution row ...
2026-04-21T19:02:23Z ERROR scheduler task d04f410f-1e2c-1d35-3208-0f84890327ab
for user 2aedb09a-... failed: storage error: missing running execution row ...
2026-04-21T19:02:51Z ERROR scheduler task eaa96484-8ccc-692c-2060-6ce150904025
for user 8574d066-... failed: storage error: missing running execution row ...
```

Task `7161c400` is now `enabled=false` in prod Mongo (user inbound to `erics_toy@icloud.com`) — the direct user-visible outcome of this cascade.

Change

DoWhiz_service/scheduler_module/src/scheduler/store/mongo.rs: when record_execution_finish's update matches no row, warn-log and return Ok(()) instead of erroring. The scheduler's intent is already satisfied.
No change to reconcile_stale_running_executions_for_task's matched_count assertion — that path has different invariants.

Complementary to PR #1499

#1499 reduces how long a row can stay orphaned (20h → 30min). This PR prevents a successful agent run from being mis-reported as a scheduler failure when that reconciler wins the race.

Test plan

cargo check -p scheduler_module --lib — clean (only pre-existing warnings).
Deploy to staging; reproduce by marking a running row failed mid-task, confirm the worker logs the warn line and tasks.enabled is not flipped to false.
Production canary: watch for the missing running execution row error to disappear from dw-worker-error.log.

Refs #1459.

…lized rows When the stale-execution reconciler (or the `dowhiz-service-debug` audit) finalizes a `task_executions` row ahead of the worker, the strict `status: "running"` filter in `record_execution_finish` yields `matched_count == 0` and the function returns `SchedulerError::Storage`. That error cascades into `after_execute_failed`, which counts as a retry and — for `OneShot` `RunTask` entries — eventually flips the task to `enabled=false`. The agent may have actually finished successfully; the user just never sees a reply. Observed today (2026-04-21) on prod `dowhiz_production_little_bear`: ``` ERROR scheduler task 7161c400-... for user 8574d066-... failed: storage error: missing running execution row ... ``` Task `7161c400` ended up `enabled=false` for an inbound email to `erics_toy@icloud.com`. Two more occurrences (`d04f410f`, `eaa96484`) landed at 19:02 UTC in the same sweep. This change treats `matched_count == 0` as best-effort: warn-log the condition and return `Ok(())`. The scheduler's intent (finalize the execution) is already satisfied by whichever path got there first, so surfacing a hard error here misrepresents a successful (or reconciled) run as a failure. Refs #1459.

vercel · 2026-04-21T19:17:17Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
dowhiz	Ready	Preview, Comment	Apr 21, 2026 8:29pm

bingran-you · 2026-04-21T20:32:11Z

Follow-up landed in bf2471a5 to narrow the race handling in record_execution_finish:

if the execution row still exists but is already terminal, we warn and treat the finish as best-effort
if the execution row is actually missing, we still return SchedulerError::Storage instead of swallowing the mismatch

Added Mongo-backed regressions for both cases and reran:

MONGODB_URI=mongodb://localhost:27017 MONGODB_DATABASE=codex_pr1516_regression cargo test -p scheduler_module --lib record_execution_finish_ -- --nocapture
MONGODB_URI=mongodb://localhost:27017 MONGODB_DATABASE=codex_pr1516_scheduler_basic cargo test -p scheduler_module --test scheduler_basic

Both passed locally.

This reply was drafted by breeze, an autonomous agent running on behalf of the account owner.

bingran-you · 2026-04-21T20:34:03Z

Follow-up landed in bf2471a5 to narrow the race handling in record_execution_finish:

if the execution row still exists but is already terminal, we warn and treat the finish as best-effort
if the execution row is actually missing, we still return SchedulerError::Storage instead of swallowing the mismatch

Added Mongo-backed regressions for both cases and reran:

MONGODB_URI=mongodb://localhost:27017 MONGODB_DATABASE=codex_pr1516_regression cargo test -p scheduler_module --lib record_execution_finish_ -- --nocapture
MONGODB_URI=mongodb://localhost:27017 MONGODB_DATABASE=codex_pr1516_scheduler_basic cargo test -p scheduler_module --test scheduler_basic

Both passed locally.

This reply was drafted by breeze, an autonomous agent running on behalf of the account owner.

bingran-you mentioned this pull request Apr 21, 2026

Scheduler hot-loops due tasks when Mongo still marks them running #1459

Open

vercel Bot deployed to Preview April 21, 2026 19:17 View deployment

bingran-you added the breeze:wip Breeze is actively working on this item label Apr 21, 2026

fix(scheduler): keep missing execution rows fatal

bf2471a

vercel Bot deployed to Preview April 21, 2026 20:29 View deployment

bingran-you added breeze:done Breeze finished handling this item and removed breeze:wip Breeze is actively working on this item labels Apr 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(scheduler): tolerate already-finalized rows in record_execution_finish (#1459)#1516

fix(scheduler): tolerate already-finalized rows in record_execution_finish (#1459)#1516
bingran-you wants to merge 2 commits intodevfrom
bry/fix-mongo-execution-finish-tolerant

bingran-you commented Apr 21, 2026

Uh oh!

vercel Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

bingran-you commented Apr 21, 2026

Uh oh!

bingran-you commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bingran-you commented Apr 21, 2026

Summary

Evidence

Change

Complementary to PR #1499

Test plan

Uh oh!

vercel Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bingran-you commented Apr 21, 2026

Uh oh!

bingran-you commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented Apr 21, 2026 •

edited

Loading