Skip to content

fix(scheduler): tolerate already-finalized rows in record_execution_finish (#1459)#1516

Open
bingran-you wants to merge 2 commits intodevfrom
bry/fix-mongo-execution-finish-tolerant
Open

fix(scheduler): tolerate already-finalized rows in record_execution_finish (#1459)#1516
bingran-you wants to merge 2 commits intodevfrom
bry/fix-mongo-execution-finish-tolerant

Conversation

@bingran-you
Copy link
Copy Markdown
Contributor

Summary

record_execution_finish required status: \"running\" in its Mongo filter. When the stale-execution reconciler or the dowhiz-service-debug audit flips a row to failed/superseded before the worker's finish call lands, matched_count == 0 and the function returns SchedulerError::Storage. That propagates into after_execute_failed and — for OneShot RunTask entries — eventually disables the task, even though the agent may have actually completed its work.

Evidence

Today's production sweep (dowhiz_production_little_bear, dw_worker 118 restarts) surfaced three instances of this cascade — see #1459 audit comment:

```
2026-04-21T17:29:37Z ERROR scheduler task 7161c400-d45d-54a6-26c1-530fb5432815
for user 8574d066-... failed: storage error: missing running execution row ...
2026-04-21T19:02:23Z ERROR scheduler task d04f410f-1e2c-1d35-3208-0f84890327ab
for user 2aedb09a-... failed: storage error: missing running execution row ...
2026-04-21T19:02:51Z ERROR scheduler task eaa96484-8ccc-692c-2060-6ce150904025
for user 8574d066-... failed: storage error: missing running execution row ...
```

Task `7161c400` is now `enabled=false` in prod Mongo (user inbound to `erics_toy@icloud.com`) — the direct user-visible outcome of this cascade.

Change

  • DoWhiz_service/scheduler_module/src/scheduler/store/mongo.rs: when record_execution_finish's update matches no row, warn-log and return Ok(()) instead of erroring. The scheduler's intent is already satisfied.
  • No change to reconcile_stale_running_executions_for_task's matched_count assertion — that path has different invariants.

Complementary to PR #1499

#1499 reduces how long a row can stay orphaned (20h → 30min). This PR prevents a successful agent run from being mis-reported as a scheduler failure when that reconciler wins the race.

Test plan

  • cargo check -p scheduler_module --lib — clean (only pre-existing warnings).
  • Deploy to staging; reproduce by marking a running row failed mid-task, confirm the worker logs the warn line and tasks.enabled is not flipped to false.
  • Production canary: watch for the missing running execution row error to disappear from dw-worker-error.log.

Refs #1459.

…lized rows

When the stale-execution reconciler (or the `dowhiz-service-debug` audit) finalizes
a `task_executions` row ahead of the worker, the strict `status: "running"` filter
in `record_execution_finish` yields `matched_count == 0` and the function returns
`SchedulerError::Storage`. That error cascades into `after_execute_failed`, which
counts as a retry and — for `OneShot` `RunTask` entries — eventually flips the
task to `enabled=false`. The agent may have actually finished successfully; the
user just never sees a reply.

Observed today (2026-04-21) on prod `dowhiz_production_little_bear`:

```
ERROR scheduler task 7161c400-... for user 8574d066-... failed:
  storage error: missing running execution row ...
```

Task `7161c400` ended up `enabled=false` for an inbound email to `erics_toy@icloud.com`.
Two more occurrences (`d04f410f`, `eaa96484`) landed at 19:02 UTC in the same sweep.

This change treats `matched_count == 0` as best-effort: warn-log the condition
and return `Ok(())`. The scheduler's intent (finalize the execution) is already
satisfied by whichever path got there first, so surfacing a hard error here
misrepresents a successful (or reconciled) run as a failure.

Refs #1459.
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 21, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
dowhiz Ready Ready Preview, Comment Apr 21, 2026 8:29pm

@bingran-you bingran-you added breeze:done Breeze finished handling this item and removed breeze:wip Breeze is actively working on this item labels Apr 21, 2026
@bingran-you
Copy link
Copy Markdown
Contributor Author

Follow-up landed in bf2471a5 to narrow the race handling in record_execution_finish:

  • if the execution row still exists but is already terminal, we warn and treat the finish as best-effort
  • if the execution row is actually missing, we still return SchedulerError::Storage instead of swallowing the mismatch

Added Mongo-backed regressions for both cases and reran:

  • MONGODB_URI=mongodb://localhost:27017 MONGODB_DATABASE=codex_pr1516_regression cargo test -p scheduler_module --lib record_execution_finish_ -- --nocapture
  • MONGODB_URI=mongodb://localhost:27017 MONGODB_DATABASE=codex_pr1516_scheduler_basic cargo test -p scheduler_module --test scheduler_basic

Both passed locally.

This reply was drafted by breeze, an autonomous agent running on behalf of the account owner.

1 similar comment
@bingran-you
Copy link
Copy Markdown
Contributor Author

Follow-up landed in bf2471a5 to narrow the race handling in record_execution_finish:

  • if the execution row still exists but is already terminal, we warn and treat the finish as best-effort
  • if the execution row is actually missing, we still return SchedulerError::Storage instead of swallowing the mismatch

Added Mongo-backed regressions for both cases and reran:

  • MONGODB_URI=mongodb://localhost:27017 MONGODB_DATABASE=codex_pr1516_regression cargo test -p scheduler_module --lib record_execution_finish_ -- --nocapture
  • MONGODB_URI=mongodb://localhost:27017 MONGODB_DATABASE=codex_pr1516_scheduler_basic cargo test -p scheduler_module --test scheduler_basic

Both passed locally.

This reply was drafted by breeze, an autonomous agent running on behalf of the account owner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breeze:done Breeze finished handling this item

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant