Skip to content

fix(debug_archive): surface actual upload error, collapse per-target spam#1492

Open
bingran-you wants to merge 1 commit intodevfrom
bry/fix-task-debug-archive-log-1488
Open

fix(debug_archive): surface actual upload error, collapse per-target spam#1492
bingran-you wants to merge 1 commit intodevfrom
bry/fix-task-debug-archive-log-1488

Conversation

@bingran-you
Copy link
Copy Markdown
Contributor

Summary

  • Fixes the log spam on dw_worker where every task_debug_archive upload emitted a generic [task_debug_archive] upload attempt failed ... line for each attempted target (observed 6818+ entries in ~5h on production), while the root cause SchedulerError was discarded by Err(_).
  • Collect each target's error into a single deferred vector and emit one consolidated line only when all targets fail, including the real error message per attempt. When any target succeeds, the log stays quiet.

Closes #1488

Behaviour before

[task_debug_archive] upload attempt failed backend=azure_blob account=dowhizingest2602190038 container=task-debug-archives path=...
[task_debug_archive] upload attempt failed backend=azure_blob account=dowhizingest2602190038 container=task-debug-archives path=...
(repeated for each archive, even though fallback target uploaded successfully)

Behaviour after

  • Success on any target → silent (as before, but the earlier per-target noise disappears).
  • All targets fail → one line:
    [task_debug_archive] all upload targets failed for path=<blob> attempts=[backend=... error=... | backend=... error=...]
    

Test plan

  • cargo test -p scheduler_module --lib scheduler::debug_archive — 4/4 pass
  • cargo check -p scheduler_module
  • After merge/deploy: confirm on dowhizprod1 that task_debug_archive log noise drops to ~0 while Mongo task_debug_archives.status="uploaded" counts stay steady.

…line

The primary task_debug_archive upload target on production fails on every
task (observed 6818+ entries over ~5 hours), but a fallback target
succeeds so the archive still lands. The per-target Err(_) arm dropped
the underlying SchedulerError and printed a generic "attempt failed"
line for each try, hiding the root cause (likely SAS token permissions)
and spamming stderr.

Collect attempt errors and only print a single line after all targets
fail, including the actual SchedulerError message for each attempt.

Refs #1488
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 21, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
dowhiz Ready Ready Preview, Comment Apr 21, 2026 1:16am

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breeze:done Breeze finished handling this item

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Prod worker: task_debug_archive spam (6818+ failed uploads / non-fatal but swallows root cause)

1 participant