Skip to content

ops: rotate pm2 + prune azcopy logs so root disk doesn't fill (#1537)#1538

Open
bingran-you wants to merge 1 commit intodevfrom
bry/vm-log-retention-cleanup
Open

ops: rotate pm2 + prune azcopy logs so root disk doesn't fill (#1537)#1538
bingran-you wants to merge 1 commit intodevfrom
bry/vm-log-retention-cleanup

Conversation

@bingran-you
Copy link
Copy Markdown
Contributor

Summary

Fixes #1537. Production VM root partition hit 98% earlier today, causing every az invocation to fail with OSError: [Errno 28] writing ~/.azure/az.sess and cascading into ACI codex execution failures (generate-sas failed, storage share delete failed, execution failed for container=...). Primary space consumer was ~/.azcopy/ at 9.6 GB (1,320 scan logs + steV20 plan files) with no retention; secondary was a 2.3 GB orphan pm2 log from a renamed service.

Changes

  • DoWhiz_service/scripts/cleanup_vm_logs.sh — new idempotent script that prunes ~/.azcopy/*.log, ~/.azcopy/plans/*.steV20, and legacy ~/.pm2/logs/dowhiz-*.log files older than 48 h (configurable via OLDER_THAN_MIN). Prints per-category size reclaimed and ending df -h /.
  • .github/workflows/CICD-production.yml + CICD-staging.yml — after pm2 service restart, install pm2-logrotate (max_size 200M, retain 5, compress, daily rotate) and register a daily 3 am cron calling the new script. Both blocks are idempotent so re-running the deploy is safe.
  • DoWhiz_service/OPERATIONS.md — new section 4.3 VM log retention documenting the bootstrap for fresh VMs.
  • DoWhiz_service/scripts/status.sh — show root disk usage and ~/.azcopy size with thresholds (⚠️ at 80%, ❌ at 90%) so operators notice before the next incident.

Mitigation already applied on VMs (out-of-band)

While filing the issue I also ran this by hand on both prod and staging, so the fleet is recovered regardless of merge time:

  1. pm2 flush + removed orphaned ~/.pm2/logs/dowhiz-* files.
  2. find ~/.azcopy -maxdepth 1 -name '*.log' -mmin +60 -delete + same for plans/*.steV20 → freed ~9.6 GB on prod.
  3. Installed pm2-logrotate with the settings above.
  4. Added the daily cleanup cron via crontab -.

Prod is now at 83% root usage (11 GB free) with az working and the scheduler executing tasks normally.

Test plan

  • CI green (YAML already validated locally: python3 -c "import yaml; yaml.safe_load(...)").
  • Manual dry-run on staging after merge: OLDER_THAN_MIN=99999999 bash scripts/cleanup_vm_logs.sh (prints "nothing to prune" for all categories — confirms it is safe to run when nothing is old).
  • After next CI/CD deploy to staging, verify crontab -l | grep cleanup_vm_logs.sh and pm2 list | grep pm2-logrotate show the new entries.
  • 24 h later, confirm ~/.azcopy on staging has not grown past a few hundred MB.

Related

Prod VM root partition hit 98% today, blocking every `az` invocation
(OSError: [Errno 28] writing ~/.azure/az.sess) and cascading into failed
ACI codex executions. Root cause was ~/.azcopy growing to 9.6 GB of
per-job scan logs and steV20 plan files with no retention, plus orphan
pm2 logs from renamed services. See issue #1537 for the full trace.

- Add scripts/cleanup_vm_logs.sh: idempotent prune of ~/.azcopy/*.log,
  ~/.azcopy/plans/*.steV20, and legacy dowhiz-* pm2 log files older than
  48h.
- CICD (prod + staging): install pm2-logrotate on every deploy
  (max_size=200M, retain=5, compress, daily rotate) and register a daily
  3am cron calling cleanup_vm_logs.sh. Both steps are idempotent.
- OPERATIONS.md: document the bootstrap in section 4.3 so fresh VMs get
  the same retention.
- status.sh: surface root-disk usage and ~/.azcopy size so operators
  notice before the next 98% event.
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 22, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
dowhiz Ready Ready Preview, Comment Apr 22, 2026 11:04am

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breeze:done Breeze finished handling this item

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[P1] prod: root disk filled by ~/.azcopy logs (9.6GB) → Azure CLI fails → ACI task execution errors

1 participant