Skip to content

scheduler: enrich TASK_FAILURE_ALERT with channel + requester (#1541)#1542

Open
bingran-you wants to merge 2 commits intodevfrom
bry/notify-task-failure-enrich
Open

scheduler: enrich TASK_FAILURE_ALERT with channel + requester (#1541)#1542
bingran-you wants to merge 2 commits intodevfrom
bry/notify-task-failure-enrich

Conversation

@bingran-you
Copy link
Copy Markdown
Contributor

Summary

Partial fix for #1541. When the watchdog disables a task after MAX_TASK_RETRIES=3, the existing log line and local .txt notification only include task_id / user_id / retry_count — not enough for ops to know which channel or which human was ghosted.

This PR adds a lookup of the disabled task from the scheduler's task list and includes channel, requester, reply_to, and task kind in both:

  • the local _notifications/task_failure_*.txt file, and
  • the TASK_FAILURE_ALERT error log line (so log-based alerting can fan out per channel and the on-call can copy the user address straight from the log).

Does not yet enqueue an outbound apology SendReply — that's the larger follow-up in #1541 because it needs per-channel loop guards (don't trigger another failure that triggers another notification, etc.). The enriched log is the minimum viable step that unblocks manual ops follow-up today.

Test plan

  • cargo build -p scheduler_module clean
  • cargo test -p scheduler_module --lib service::scheduler — 11 passed, 0 failed
  • Smoke: force a stale-task disable on staging and verify TASK_FAILURE_ALERT log line includes channel=... requester=... reply_to=...

Context

Discovered during 2026-04-22 scheduled scan of prod + staging. Prod little_bear has 5 tasks with retry_count > 0 and enabled=false — real users (listed in #1541) who never got a reply. With the old log line there was no way to know from log grep alone that u3597436@connect.hku.hk was one of them.

When the watchdog disables a task after MAX_TASK_RETRIES=3, the only
"notification" was a local .txt file + a log line with just task_id /
user_id / retry count. Ops had no way to page out by channel or to
identify which requester still needs a manual follow-up.

This change loads the disabled task from the scheduler's task list and
adds channel, requester, reply_to, and task kind to both the local
notification file and the TASK_FAILURE_ALERT error log, so:

- log-based alerts can fan out per channel,
- on-call can copy the requester address directly from logs,
- post-incident manual remediation has an audit trail.

Does not yet enqueue an outbound apology SendReply — that's the larger
follow-up in #1541 (needs per-channel loop guards).
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 22, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
dowhiz Ready Ready Preview, Comment Apr 22, 2026 1:30pm

Fixes cargo fmt --check CI failure from #1542.
@bingran-you
Copy link
Copy Markdown
Contributor Author

CI was red on cargo fmt --all --check — rustfmt wanted the ctx.requester.clone().unwrap_or_else(...) / ctx.reply_to.clone().unwrap_or_else(...) tuple entries in scheduler.rs:719 broken across multiple lines. Pushed a68f288 which applies cargo fmt -p scheduler_module — whitespace only, no behavior change. Should turn the rust check green on re-run. This reply was drafted by breeze, an autonomous agent running on behalf of the account owner.

@bingran-you bingran-you added breeze:done Breeze finished handling this item and removed breeze:wip Breeze is actively working on this item labels Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breeze:done Breeze finished handling this item

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant