Skip to content

Raise ACI delete timeout from 20s to 120s (fixes #1543)#1544

Open
bingran-you wants to merge 2 commits intodevfrom
bry/aci-delete-timeout-fix
Open

Raise ACI delete timeout from 20s to 120s (fixes #1543)#1544
bingran-you wants to merge 2 commits intodevfrom
bry/aci-delete-timeout-fix

Conversation

@bingran-you
Copy link
Copy Markdown
Contributor

Summary

  • Align the inline ACI cleanup timeout in run_codex_task_azure_aci with the retry helper (delete_aci_container uses 120s). The 20s value was tripping on nearly every task completion on prod, producing noisy delete-request failed lines and leaving orphan containers for the periodic reconciler.

Evidence

Prod dowhizprod1 worker stderr over ~7h uptime contained 11 occurrences of:

[run_task] azure_aci delete-request failed container=dwz-codex-... error=Command timed out (az container delete after 20s).

Fixes #1543. Related: #1537, #1539.

Test plan

  • cargo build -p run_task_module locally
  • Deploy to staging, verify no delete-request failed lines for at least 1h of task activity
  • Promote to prod once staging is clean

Azure CLI's `az container delete` routinely takes 25-60s in westus2,
especially when the CLI process is cold. The 20s hot-path timeout was
tripping on ~every completed run_task execution on prod, producing
misleading "delete-request failed" log entries and leaving containers
for the periodic reconciler / pool_manager to clean up.

The retry helper `delete_aci_container` already uses 120s for the same
operation; align the hot path with that value so cleanup actually
succeeds inline.

Evidence: `pm2 logs dw_worker --err --nostream | grep -c 'delete-request
failed.*20s'` on prod returned 11 hits over ~7h uptime; all successful
task executions.
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 22, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
dowhiz Ready Ready Preview, Comment Apr 22, 2026 2:16pm

@bingran-you
Copy link
Copy Markdown
Contributor Author

Blocker: merge conflict — prod still wedged on 214cfd9 without this PR

Per the devops-scan follow-up on #1559 (#1559 (comment)), prod scheduler has been deferring ~35 due tasks/second since 01:08 UTC with reason=thread busy, and the only fix for the orphan-claim hot-loop (c1792f8 on this branch) is gated behind this PR.

gh pr view 1544 --json mergeStateStatus reports DIRTY / CONFLICTING. CI checks are all green (rust, website, Vercel). The blocker is the merge conflict against dev (likely from PR #1548 landing earlier).

Requesting: rebase this branch on latest dev, then merge. If the ACI-delete-timeout change is separable from the orphan-claim reclaim fix, I can cut a narrower PR with just c1792f8 so the reliability win doesn't wait on the timeout review — let me know if that would help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breeze:done Breeze finished handling this item

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ACI delete-request timeout too short (20s) causing cleanup failures in hot path

1 participant