ops: rotate pm2 + prune azcopy logs so root disk doesn't fill (#1537)#1538
Open
bingran-you wants to merge 1 commit intodevfrom
Open
ops: rotate pm2 + prune azcopy logs so root disk doesn't fill (#1537)#1538bingran-you wants to merge 1 commit intodevfrom
bingran-you wants to merge 1 commit intodevfrom
Conversation
Prod VM root partition hit 98% today, blocking every `az` invocation (OSError: [Errno 28] writing ~/.azure/az.sess) and cascading into failed ACI codex executions. Root cause was ~/.azcopy growing to 9.6 GB of per-job scan logs and steV20 plan files with no retention, plus orphan pm2 logs from renamed services. See issue #1537 for the full trace. - Add scripts/cleanup_vm_logs.sh: idempotent prune of ~/.azcopy/*.log, ~/.azcopy/plans/*.steV20, and legacy dowhiz-* pm2 log files older than 48h. - CICD (prod + staging): install pm2-logrotate on every deploy (max_size=200M, retain=5, compress, daily rotate) and register a daily 3am cron calling cleanup_vm_logs.sh. Both steps are idempotent. - OPERATIONS.md: document the bootstrap in section 4.3 so fresh VMs get the same retention. - status.sh: surface root-disk usage and ~/.azcopy size so operators notice before the next 98% event.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #1537. Production VM root partition hit 98% earlier today, causing every
azinvocation to fail withOSError: [Errno 28] writing ~/.azure/az.sessand cascading into ACI codex execution failures (generate-sas failed,storage share delete failed,execution failed for container=...). Primary space consumer was~/.azcopy/at 9.6 GB (1,320 scan logs + steV20 plan files) with no retention; secondary was a 2.3 GB orphan pm2 log from a renamed service.Changes
DoWhiz_service/scripts/cleanup_vm_logs.sh— new idempotent script that prunes~/.azcopy/*.log,~/.azcopy/plans/*.steV20, and legacy~/.pm2/logs/dowhiz-*.logfiles older than 48 h (configurable viaOLDER_THAN_MIN). Prints per-category size reclaimed and endingdf -h /..github/workflows/CICD-production.yml+CICD-staging.yml— after pm2 service restart, installpm2-logrotate(max_size 200M, retain 5, compress, daily rotate) and register a daily 3 am cron calling the new script. Both blocks are idempotent so re-running the deploy is safe.DoWhiz_service/OPERATIONS.md— new section4.3 VM log retentiondocumenting the bootstrap for fresh VMs.DoWhiz_service/scripts/status.sh— show root disk usage and~/.azcopysize with thresholds (Mitigation already applied on VMs (out-of-band)
While filing the issue I also ran this by hand on both prod and staging, so the fleet is recovered regardless of merge time:
pm2 flush+ removed orphaned~/.pm2/logs/dowhiz-*files.find ~/.azcopy -maxdepth 1 -name '*.log' -mmin +60 -delete+ same forplans/*.steV20→ freed ~9.6 GB on prod.pm2-logrotatewith the settings above.crontab -.Prod is now at 83% root usage (11 GB free) with
azworking and the scheduler executing tasks normally.Test plan
python3 -c "import yaml; yaml.safe_load(...)").OLDER_THAN_MIN=99999999 bash scripts/cleanup_vm_logs.sh(prints "nothing to prune" for all categories — confirms it is safe to run when nothing is old).crontab -l | grep cleanup_vm_logs.shandpm2 list | grep pm2-logrotateshow the new entries.~/.azcopyon staging has not grown past a few hundred MB.Related
pm2-logrotate. Issue [P2] dw_worker-out.log grew to 2.15 GB on production — no log rotation #1465 can be closed once this merges and the next deploy runs.