feat(vet): add pup vet observability health check command by platinummonkey · Pull Request #142 · datadog-labs/pup

platinummonkey · 2026-03-01T01:12:58Z

Summary

Implementation of pup vet from discussion #132 — a health check command that surfaces universally broken Datadog configurations without assuming org-specific structure. Single monitors list API call powers all checks except pager-burden, which adds a secondary Events API call for 30-day alert history.

Checks

Check	Severity	What it catches
`silent-monitors`	CRITICAL	Monitors with no `@`-mention — alerts fire into the void
`stale-monitors`	WARNING	Monitors in "No Data" state — abandoned or misconfigured
`muted-forgotten`	WARNING	Monitors muted indefinitely or silence expiry >30 days out
`untagged-monitors`	WARNING	Monitors with no tags — can't be filtered, routed, or grouped
`no-recovery-threshold`	INFO	Critical threshold set but no `critical_recovery` — flapping risk
`fast-renotify-interval`	INFO	Config audit: `renotify_interval` ≤60 min — will spam on-call if fired
`pager-burden`	WARNING	Top paging monitors by 30d alert history, ranked by frequency; DD On-Call, PagerDuty, OpsGenie, VictorOps

Usage

pup vet                              # run all checks
pup vet --tags=team:platform         # scope by monitor tags
pup vet --check=pager-burden         # single check
pup vet --severity=warning           # filter output by minimum severity
pup vet list                         # list all available checks

Human output example:

CRITICAL: silent-monitors (2 found)
  - #12345 "High CPU on web-api" (no notification channel (@-mention) in message)
  - #67890 "Disk usage > 90%" (no notification channel (@-mention) in message)
  -> Add @mention or notification channel so alerts reach on-call responders

WARNING: pager-burden (3 found)
  - #11111 "Payment Latency" (23 pages (30d), currently alerting via PagerDuty (@pagerduty-payments), re-notifying every 15 min [team:platform])
  - #22222 "Auth Errors" (8 pages (30d) via Datadog On-Call (@oncall-platform) [team:platform])
  - #33333 "Cache Miss Rate" (currently alerting via PagerDuty (@pagerduty-api) [team:backend])
  -> Investigate top contributors — Datadog On-Call and PagerDuty pages wake up on-call responders

PASSED: stale-monitors, muted-forgotten, untagged-monitors, no-recovery-threshold, fast-renotify-interval

Summary: 1 critical, 1 warning, 5 passed

Agent mode returns a structured JSON envelope with next_action guidance when criticals are present.

Architecture

src/ops/
├── mod.rs
└── vet.rs       # check engine: Severity, Resource, Finding, VetResult
                 # handle parsing + pager-tool classification
                 # async event history fetch (pager-burden only)
src/commands/
└── vet.rs       # clap CLI wrapper, human + agent output

pager-burden detail:

Parses @-handles from monitor messages, classifies by tool (DD On-Call @oncall-*, PagerDuty @pagerduty*, OpsGenie @opsgenie-*, VictorOps @victorops-*)
Fetches 30d of monitor alert events from /api/v1/events (raw JSON — typed Event model omits monitor_id)
Sorted by page count × 2 + currently-alerting bonus, so active alerts float within the same count bucket
Degrades silently if the Events API is unavailable (missing permissions, etc.)

SDK dependency

This PR patches datadog-api-client via [patch.crates-io] pointing at a local fix branch: DataDog/datadog-api-client-rust#1292.

The MonitorThresholds deserializer panics with invalid type: string "", expected f64 when the Datadog API returns "" instead of null for unset threshold values (service-check and composite monitors). Without the patch, pup vet — and all other monitor commands — fail for any org that has one of these monitors.

The Cargo.toml patch entry has a TODO: remove comment so it's easy to clean up once the upstream fix is released.

Open questions / next steps

Pagination: monitors fetch is capped at 1000; events fetch returns the most recent ~1000 events. Cursor-based paging for both would improve accuracy in large orgs
stale-monitors: currently catches the "No Data" snapshot — consider --stale-days flag using monitor.modified as a proxy for duration
pager-burden team grouping: resources are a flat sorted list; human output could group by team: tag for readability
On-Call schedule cross-reference: @team-<handle> routes through DD On-Call only if On-Call is configured for that team — currently undetectable from monitor message alone

Closes #132

🤖 Generated with Claude Code

Implements `pup vet` per the design in discussion #132. Surfaces universally broken Datadog configurations with a single monitors API call. Checks implemented: - silent-monitors (CRITICAL): monitors with no @-mention/notification channel in their message — alerts fire into the void - stale-monitors (WARNING): monitors currently in "No Data" state — abandoned or misconfigured data source - muted-forgotten (WARNING): monitors muted indefinitely or with a silence expiry >30 days out New files: - src/ops/mod.rs + src/ops/vet.rs — check engine with shared types (Severity, Resource, Finding, VetResult) - src/commands/vet.rs — CLI presentation layer (human + agent mode) Usage: pup vet # run all checks pup vet --tags=team:platform # scope by monitor tags pup vet --check=silent-monitors # single check pup vet --severity=critical # filter output pup vet list # list available checks Also adds a [patch.crates-io] for datadog-api-client pointing at a local fix branch for a serde deserialisation crash when the Datadog API returns "" for f64 threshold fields (MonitorThresholds). Upstream fix proposed at DataDog/datadog-api-client-rust#1292. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…th checks Three new checks, all running against the existing single monitors list call: - untagged-monitors (WARNING): monitors with no tags — can't be filtered, routed, or grouped by team/service/env - no-recovery-threshold (INFO): monitors with a critical threshold but no critical_recovery threshold — without hysteresis the monitor can flap at the alert boundary - on-call-health (WARNING): monitors currently in ALERT state with a renotify_interval between 1–60 minutes — actively re-paging on-call responders and contributing to notification fatigue Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…tify-interval pager-burden (WARNING): Surfaces monitors currently in ALERT state that are actively routing through high-impact pager tools — Datadog On-Call (@oncall-*), PagerDuty (@PagerDuty*), OpsGenie (@opsgenie-*), VictorOps (@victorops-*). Pager-tool entries sort before Slack/webhook re-notifiers. Detail shows the specific tool and handle, plus renotify_interval if set. Also catches non-pager monitors with renotify_interval ≤60 min that are currently alerting (actively notifying frequently). fast-renotify-interval (INFO, renamed from on-call-health): Now a pure configuration audit — any monitor with renotify_interval configured between 1–60 minutes, regardless of current alert state. Complements pager-burden: "this monitor will spam on-call if it fires" vs "this monitor is paging on-call right now". Also adds unit tests for extract_handles and classify_handle. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…ributors Previously pager-burden only looked at current alert state. Now it: 1. Fetches the last 30 days of monitor alert events via /api/v1/events, counting ERROR and WARNING events per monitor_id. 2. Surfaces monitors with pager-tool handles (@oncall-*, @PagerDuty*, @opsgenie-*, @victorops-*) that have fired recently OR are currently alerting. Quiet monitors with pager handles are not flagged. 3. Sorts by page count descending; currently-alerting monitors float to the top within the same count bucket, since they're actively paging right now. 4. Detail includes page count with lookback window, pager tool + handle, renotify_interval if set, and team tag if present: "23 pages (30d), currently alerting via PagerDuty (@pagerduty-payments), re-notifying every 15 min [team:platform]" The Events API call uses the raw JSON path (crate::api::get) since the typed Event model doesn't expose monitor_id — it lands in additional_properties. Degrades silently on API failure so the check doesn't block the rest of pup vet. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

platinummonkey and others added 5 commits February 28, 2026 19:12

style: cargo fmt

35c6a92

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(vet): add pup vet observability health check command#142

feat(vet): add pup vet observability health check command#142
platinummonkey wants to merge 5 commits intomainfrom
feat/vet-command

platinummonkey commented Mar 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

platinummonkey commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checks

Usage

Architecture

SDK dependency

Open questions / next steps

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

platinummonkey commented Mar 1, 2026 •

edited

Loading