Skip to content

ops: backup failure alerting — healthchecks.io dead-man's switch#148

Merged
GitAddRemote merged 1 commit intomainfrom
feature/ISSUE-133
Apr 28, 2026
Merged

ops: backup failure alerting — healthchecks.io dead-man's switch#148
GitAddRemote merged 1 commit intomainfrom
feature/ISSUE-133

Conversation

@GitAddRemote
Copy link
Copy Markdown
Owner

Summary

  • Adds infra/docs/backups.md — the operational runbook for the backup system covering all 7 sections from the issue spec: what is backed up, where it lives, how to verify, alert response checklist, retention policy, silencing false alarms, and how to restore
  • Updates infra/README.md to reference ops: backup failure alerting — notify when nightly backup cron fails #133 and link to the new doc

Context

The backup script (backup-db.sh), healthcheck ping logic, BACKUP_HEALTHCHECK_URL secret injection in release.yml, and cron job setup in bootstrap-vps.sh were all already in place from earlier issues (#125, #128). The only remaining code-deliverable for #133 was the runbook.

The remaining DoD items are manual operational steps:

  • Create the healthchecks.io check (24h period, 1h grace) and configure alert channels
  • Add BACKUP_HEALTHCHECK_URL to GitHub production environment secrets
  • End-to-end verification: temporarily remove the ping, confirm alert fires, restore it

Test plan

  • Review infra/docs/backups.md for accuracy against the live scripts (backup-db.sh, restore-db.sh, bootstrap-vps.sh)
  • Confirm restore command example in the doc matches restore-db.sh usage ($0 <b2-path>)
  • Confirm rclone ls command uses correct config flag
  • Confirm infra/README.md conflict resolved cleanly (ISSUE-133 blurb appears before the Redis Persistence section)

Closes #133

Adds infra/docs/backups.md covering all 7 sections required by #133:

- What is backed up (PostgreSQL, nightly at 3AM UTC + pre-deploy)

- Where backups live (B2 bucket path structure)

- How to verify backups are running (log tail + healthchecks.io)

- What to do when a backup alert fires (5-step checklist)

- Retention policy (180-day B2 lifecycle)

- How to silence a false alarm (healthchecks.io pause/mute)

- How to restore using restore-db.sh

Also updates infra/README.md to reference #133 and the new doc.

Closes #133
@GitAddRemote GitAddRemote merged commit 291fae4 into main Apr 28, 2026
1 check passed
@GitAddRemote GitAddRemote deleted the feature/ISSUE-133 branch April 28, 2026 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ops: backup failure alerting — notify when nightly backup cron fails

1 participant