Skip to content

Runbook

Susan Valente edited this page Apr 17, 2026 · 31 revisions

Runbook

This document describes common procedures for operation of Data.gov.

Related procedures

Services

Alerts

We designate two types of alerts:

  • Critical -- drop what you are doing; an outage is happening or action is required to prevent one. Critical alerts go to #datagov-alerts.
  • Warning -- indicates a problem but can wait until the next business day. Warnings go to #datagov-alerts as email notifications.

New Relic: Host unavailable

Triggered when a host is not reporting to New Relic for 5 minutes.

Resolution

  • Check New Relic for obvious issues (high memory or CPU load)
  • If the app is down, check cloud.gov for application status and recent deploy activity
  • Restart the application via cloud.gov if needed
  • If unresolvable, open a ticket with BSP or FCS requests

New Relic: High error rate

Triggered when 4xx or 5xx error rates exceed thresholds.

Resolution

  • Check New Relic for which service is affected
  • Review cloud.gov logs for the affected application
  • Check recent deploys for a likely cause
  • Escalate to the contractor team via #datagov-dev if not resolvable

Troubleshooting Nessus scans

SecOps performs regular scans on our hosts. If the ISSO contacts us regarding IPs that could not be authenticated, see Nessus Agent Upgrade for current procedures.


Note: This runbook is a stub and needs contractor input to document current alert resolution procedures for the cloud.gov-based stack.

Clone this wiki locally