scheduler: enrich TASK_FAILURE_ALERT with channel + requester (#1541)#1542
Open
bingran-you wants to merge 2 commits intodevfrom
Open
scheduler: enrich TASK_FAILURE_ALERT with channel + requester (#1541)#1542bingran-you wants to merge 2 commits intodevfrom
bingran-you wants to merge 2 commits intodevfrom
Conversation
When the watchdog disables a task after MAX_TASK_RETRIES=3, the only "notification" was a local .txt file + a log line with just task_id / user_id / retry count. Ops had no way to page out by channel or to identify which requester still needs a manual follow-up. This change loads the disabled task from the scheduler's task list and adds channel, requester, reply_to, and task kind to both the local notification file and the TASK_FAILURE_ALERT error log, so: - log-based alerts can fan out per channel, - on-call can copy the requester address directly from logs, - post-incident manual remediation has an audit trail. Does not yet enqueue an outbound apology SendReply — that's the larger follow-up in #1541 (needs per-channel loop guards).
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Fixes cargo fmt --check CI failure from #1542.
Contributor
Author
|
CI was red on |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Partial fix for #1541. When the watchdog disables a task after
MAX_TASK_RETRIES=3, the existing log line and local.txtnotification only includetask_id/user_id/retry_count— not enough for ops to know which channel or which human was ghosted.This PR adds a lookup of the disabled task from the scheduler's task list and includes
channel,requester,reply_to, and taskkindin both:_notifications/task_failure_*.txtfile, andTASK_FAILURE_ALERTerror log line (so log-based alerting can fan out per channel and the on-call can copy the user address straight from the log).Does not yet enqueue an outbound apology
SendReply— that's the larger follow-up in #1541 because it needs per-channel loop guards (don't trigger another failure that triggers another notification, etc.). The enriched log is the minimum viable step that unblocks manual ops follow-up today.Test plan
cargo build -p scheduler_modulecleancargo test -p scheduler_module --lib service::scheduler— 11 passed, 0 failedTASK_FAILURE_ALERTlog line includeschannel=... requester=... reply_to=...Context
Discovered during 2026-04-22 scheduled scan of prod + staging. Prod
little_bearhas 5 tasks withretry_count > 0andenabled=false— real users (listed in #1541) who never got a reply. With the old log line there was no way to know from log grep alone thatu3597436@connect.hku.hkwas one of them.