-
Notifications
You must be signed in to change notification settings - Fork 0
Operations Runbook
Karthik edited this page Mar 1, 2026
·
1 revision
- Service health and error-rate monitoring
- Queue/backlog growth review
- Critical alert triage and acknowledgement
- Identify impacted domain (auth, ingestion, API, alerts, orders).
- Confirm blast radius (users/regions/roles).
- Apply mitigation (rollback, config switch, dependency failover).
- Validate restoration and monitor stability.
- Restart/redeploy specific Lambda functions
- Reconcile failed order/shipment events
- Reprocess telemetry batches when safe
- Validate Cognito group and permissions mapping
- Capture root cause and timeline
- Add guardrail tests for recurrence prevention
- Update this runbook with concrete lessons learned
Last updated: 2026-03-01