Skip to content

fix(run_task): use ACI delete retry helper for cleanup (#1503)#1504

Open
bingran-you wants to merge 1 commit intodevfrom
bry/fix-aci-delete-timeout-1503
Open

fix(run_task): use ACI delete retry helper for cleanup (#1503)#1504
bingran-you wants to merge 1 commit intodevfrom
bry/fix-aci-delete-timeout-1503

Conversation

@bingran-you
Copy link
Copy Markdown
Contributor

Summary

  • Switch the post-execution ACI cleanup path (codex.rs:1374) from a single-shot 20s az container delete to the existing delete_aci_container_with_retry helper (120s × 3 attempts, treats not-found as success).
  • Unifies the two cleanup paths (post-execution + warm-pool) so they share the same retry/timeout behavior.

Closes #1503.

Why

Production was logging ~16 delete-request failed ... Command timed out (az container delete after 20s) per day, leaving 6 ghost Succeeded containers in rg-dowhiz-oliver-dev. 20s is too tight for Azure ACI delete under normal load (30–90s typical).

Test plan

  • cargo check -p run_task_module (passes locally).
  • Deploy to staging via the standard CICD-staging workflow on dev after merge.
  • After 24h: pm2 logs dw_worker | grep 'delete-request failed' should be ~0.
  • az container list --resource-group dowhiz-staging-rg-260226124234 -o table | grep dwz-codex should not grow monotonically.

…#1503)

The post-execution cleanup at codex.rs:1374 was using a single-shot 20s
timeout against `az container delete`. Azure ACI delete regularly takes
30–90s under normal load, so the cleanup silently failed and containers
piled up in the resource group as `Succeeded` ghosts (16 failures/day on
prod, 6 leaked containers in `rg-dowhiz-oliver-dev` at time of fix).

The same module already exposes `delete_aci_container_with_retry`
(120s × 3 attempts, treats not-found as success), used by the warm-pool
cleanup path. Switch this call site to the retry helper so both cleanup
paths share the same battle-tested timeout/retry behavior.
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 21, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
dowhiz Ready Ready Preview, Comment Apr 21, 2026 0:29am

@bingran-you
Copy link
Copy Markdown
Contributor Author

This keeps compounding. Today's 15:15 UTC dowhiz-service-debug sweep again observed azure_aci delete-request failed ... Command timed out (az container delete after 20s) repeatedly on prod worker (uptime 5h, 118 restarts). Stuck/orphan ACI containers are likely also contributing to the scheduler churn addressed by #1499.

Low-risk, small diff — please prioritize review.

@bingran-you bingran-you added breeze:wip Breeze is actively working on this item breeze:done Breeze finished handling this item and removed breeze:wip Breeze is actively working on this item labels Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breeze:done Breeze finished handling this item

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[P2] Prod ACI delete-request timeout 20s too short — containers pile up in rg-dowhiz-oliver-dev

1 participant