Skip to content

feat(vet): add pup vet observability health check command#142

Draft
platinummonkey wants to merge 5 commits intomainfrom
feat/vet-command
Draft

feat(vet): add pup vet observability health check command#142
platinummonkey wants to merge 5 commits intomainfrom
feat/vet-command

Conversation

@platinummonkey
Copy link
Collaborator

@platinummonkey platinummonkey commented Mar 1, 2026

Summary

Implementation of pup vet from discussion #132 — a health check command that surfaces universally broken Datadog configurations without assuming org-specific structure. Single monitors list API call powers all checks except pager-burden, which adds a secondary Events API call for 30-day alert history.

Checks

Check Severity What it catches
silent-monitors CRITICAL Monitors with no @-mention — alerts fire into the void
stale-monitors WARNING Monitors in "No Data" state — abandoned or misconfigured
muted-forgotten WARNING Monitors muted indefinitely or silence expiry >30 days out
untagged-monitors WARNING Monitors with no tags — can't be filtered, routed, or grouped
no-recovery-threshold INFO Critical threshold set but no critical_recovery — flapping risk
fast-renotify-interval INFO Config audit: renotify_interval ≤60 min — will spam on-call if fired
pager-burden WARNING Top paging monitors by 30d alert history, ranked by frequency; DD On-Call, PagerDuty, OpsGenie, VictorOps

Usage

pup vet                              # run all checks
pup vet --tags=team:platform         # scope by monitor tags
pup vet --check=pager-burden         # single check
pup vet --severity=warning           # filter output by minimum severity
pup vet list                         # list all available checks

Human output example:

CRITICAL: silent-monitors (2 found)
  - #12345 "High CPU on web-api" (no notification channel (@-mention) in message)
  - #67890 "Disk usage > 90%" (no notification channel (@-mention) in message)
  -> Add @mention or notification channel so alerts reach on-call responders

WARNING: pager-burden (3 found)
  - #11111 "Payment Latency" (23 pages (30d), currently alerting via PagerDuty (@pagerduty-payments), re-notifying every 15 min [team:platform])
  - #22222 "Auth Errors" (8 pages (30d) via Datadog On-Call (@oncall-platform) [team:platform])
  - #33333 "Cache Miss Rate" (currently alerting via PagerDuty (@pagerduty-api) [team:backend])
  -> Investigate top contributors — Datadog On-Call and PagerDuty pages wake up on-call responders

PASSED: stale-monitors, muted-forgotten, untagged-monitors, no-recovery-threshold, fast-renotify-interval

Summary: 1 critical, 1 warning, 5 passed

Agent mode returns a structured JSON envelope with next_action guidance when criticals are present.

Architecture

src/ops/
├── mod.rs
└── vet.rs       # check engine: Severity, Resource, Finding, VetResult
                 # handle parsing + pager-tool classification
                 # async event history fetch (pager-burden only)
src/commands/
└── vet.rs       # clap CLI wrapper, human + agent output

pager-burden detail:

  • Parses @-handles from monitor messages, classifies by tool (DD On-Call @oncall-*, PagerDuty @pagerduty*, OpsGenie @opsgenie-*, VictorOps @victorops-*)
  • Fetches 30d of monitor alert events from /api/v1/events (raw JSON — typed Event model omits monitor_id)
  • Sorted by page count × 2 + currently-alerting bonus, so active alerts float within the same count bucket
  • Degrades silently if the Events API is unavailable (missing permissions, etc.)

SDK dependency

This PR patches datadog-api-client via [patch.crates-io] pointing at a local fix branch: DataDog/datadog-api-client-rust#1292.

The MonitorThresholds deserializer panics with invalid type: string "", expected f64 when the Datadog API returns "" instead of null for unset threshold values (service-check and composite monitors). Without the patch, pup vet — and all other monitor commands — fail for any org that has one of these monitors.

The Cargo.toml patch entry has a TODO: remove comment so it's easy to clean up once the upstream fix is released.

Open questions / next steps

  • Pagination: monitors fetch is capped at 1000; events fetch returns the most recent ~1000 events. Cursor-based paging for both would improve accuracy in large orgs
  • stale-monitors: currently catches the "No Data" snapshot — consider --stale-days flag using monitor.modified as a proxy for duration
  • pager-burden team grouping: resources are a flat sorted list; human output could group by team: tag for readability
  • On-Call schedule cross-reference: @team-<handle> routes through DD On-Call only if On-Call is configured for that team — currently undetectable from monitor message alone

Closes #132


🤖 Generated with Claude Code

platinummonkey and others added 5 commits February 28, 2026 19:12
Implements `pup vet` per the design in discussion #132. Surfaces
universally broken Datadog configurations with a single monitors
API call.

Checks implemented:
- silent-monitors (CRITICAL): monitors with no @-mention/notification
  channel in their message — alerts fire into the void
- stale-monitors (WARNING): monitors currently in "No Data" state —
  abandoned or misconfigured data source
- muted-forgotten (WARNING): monitors muted indefinitely or with a
  silence expiry >30 days out

New files:
- src/ops/mod.rs + src/ops/vet.rs — check engine with shared types
  (Severity, Resource, Finding, VetResult)
- src/commands/vet.rs — CLI presentation layer (human + agent mode)

Usage:
  pup vet                          # run all checks
  pup vet --tags=team:platform     # scope by monitor tags
  pup vet --check=silent-monitors  # single check
  pup vet --severity=critical      # filter output
  pup vet list                     # list available checks

Also adds a [patch.crates-io] for datadog-api-client pointing at a
local fix branch for a serde deserialisation crash when the Datadog
API returns "" for f64 threshold fields (MonitorThresholds). Upstream
fix proposed at DataDog/datadog-api-client-rust#1292.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…th checks

Three new checks, all running against the existing single monitors list call:

- untagged-monitors (WARNING): monitors with no tags — can't be filtered,
  routed, or grouped by team/service/env

- no-recovery-threshold (INFO): monitors with a critical threshold but no
  critical_recovery threshold — without hysteresis the monitor can flap
  at the alert boundary

- on-call-health (WARNING): monitors currently in ALERT state with a
  renotify_interval between 1–60 minutes — actively re-paging on-call
  responders and contributing to notification fatigue

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…tify-interval

pager-burden (WARNING):
  Surfaces monitors currently in ALERT state that are actively routing
  through high-impact pager tools — Datadog On-Call (@oncall-*),
  PagerDuty (@PagerDuty*), OpsGenie (@opsgenie-*), VictorOps
  (@victorops-*). Pager-tool entries sort before Slack/webhook
  re-notifiers. Detail shows the specific tool and handle, plus
  renotify_interval if set.

  Also catches non-pager monitors with renotify_interval ≤60 min
  that are currently alerting (actively notifying frequently).

fast-renotify-interval (INFO, renamed from on-call-health):
  Now a pure configuration audit — any monitor with renotify_interval
  configured between 1–60 minutes, regardless of current alert state.
  Complements pager-burden: "this monitor will spam on-call if it fires"
  vs "this monitor is paging on-call right now".

Also adds unit tests for extract_handles and classify_handle.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ributors

Previously pager-burden only looked at current alert state. Now it:

1. Fetches the last 30 days of monitor alert events via /api/v1/events,
   counting ERROR and WARNING events per monitor_id.

2. Surfaces monitors with pager-tool handles (@oncall-*, @PagerDuty*,
   @opsgenie-*, @victorops-*) that have fired recently OR are currently
   alerting. Quiet monitors with pager handles are not flagged.

3. Sorts by page count descending; currently-alerting monitors float to
   the top within the same count bucket, since they're actively paging
   right now.

4. Detail includes page count with lookback window, pager tool + handle,
   renotify_interval if set, and team tag if present:
     "23 pages (30d), currently alerting via PagerDuty (@pagerduty-payments),
      re-notifying every 15 min [team:platform]"

The Events API call uses the raw JSON path (crate::api::get) since the
typed Event model doesn't expose monitor_id — it lands in
additional_properties. Degrades silently on API failure so the check
doesn't block the rest of pup vet.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant