Skip to content

Operations Runbook

Karthik edited this page Mar 1, 2026 · 1 revision

Operations Runbook

Daily Checks

  • Service health and error-rate monitoring
  • Queue/backlog growth review
  • Critical alert triage and acknowledgement

On-Call Triage Flow

  1. Identify impacted domain (auth, ingestion, API, alerts, orders).
  2. Confirm blast radius (users/regions/roles).
  3. Apply mitigation (rollback, config switch, dependency failover).
  4. Validate restoration and monitor stability.

Common Operational Tasks

  • Restart/redeploy specific Lambda functions
  • Reconcile failed order/shipment events
  • Reprocess telemetry batches when safe
  • Validate Cognito group and permissions mapping

Post-Incident

  • Capture root cause and timeline
  • Add guardrail tests for recurrence prevention
  • Update this runbook with concrete lessons learned

Clone this wiki locally