feat(vet): add pup vet observability health check command#142
Draft
platinummonkey wants to merge 5 commits intomainfrom
Draft
feat(vet): add pup vet observability health check command#142platinummonkey wants to merge 5 commits intomainfrom
platinummonkey wants to merge 5 commits intomainfrom
Conversation
Implements `pup vet` per the design in discussion #132. Surfaces universally broken Datadog configurations with a single monitors API call. Checks implemented: - silent-monitors (CRITICAL): monitors with no @-mention/notification channel in their message — alerts fire into the void - stale-monitors (WARNING): monitors currently in "No Data" state — abandoned or misconfigured data source - muted-forgotten (WARNING): monitors muted indefinitely or with a silence expiry >30 days out New files: - src/ops/mod.rs + src/ops/vet.rs — check engine with shared types (Severity, Resource, Finding, VetResult) - src/commands/vet.rs — CLI presentation layer (human + agent mode) Usage: pup vet # run all checks pup vet --tags=team:platform # scope by monitor tags pup vet --check=silent-monitors # single check pup vet --severity=critical # filter output pup vet list # list available checks Also adds a [patch.crates-io] for datadog-api-client pointing at a local fix branch for a serde deserialisation crash when the Datadog API returns "" for f64 threshold fields (MonitorThresholds). Upstream fix proposed at DataDog/datadog-api-client-rust#1292. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…th checks Three new checks, all running against the existing single monitors list call: - untagged-monitors (WARNING): monitors with no tags — can't be filtered, routed, or grouped by team/service/env - no-recovery-threshold (INFO): monitors with a critical threshold but no critical_recovery threshold — without hysteresis the monitor can flap at the alert boundary - on-call-health (WARNING): monitors currently in ALERT state with a renotify_interval between 1–60 minutes — actively re-paging on-call responders and contributing to notification fatigue Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…tify-interval pager-burden (WARNING): Surfaces monitors currently in ALERT state that are actively routing through high-impact pager tools — Datadog On-Call (@oncall-*), PagerDuty (@PagerDuty*), OpsGenie (@opsgenie-*), VictorOps (@victorops-*). Pager-tool entries sort before Slack/webhook re-notifiers. Detail shows the specific tool and handle, plus renotify_interval if set. Also catches non-pager monitors with renotify_interval ≤60 min that are currently alerting (actively notifying frequently). fast-renotify-interval (INFO, renamed from on-call-health): Now a pure configuration audit — any monitor with renotify_interval configured between 1–60 minutes, regardless of current alert state. Complements pager-burden: "this monitor will spam on-call if it fires" vs "this monitor is paging on-call right now". Also adds unit tests for extract_handles and classify_handle. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ributors Previously pager-burden only looked at current alert state. Now it: 1. Fetches the last 30 days of monitor alert events via /api/v1/events, counting ERROR and WARNING events per monitor_id. 2. Surfaces monitors with pager-tool handles (@oncall-*, @PagerDuty*, @opsgenie-*, @victorops-*) that have fired recently OR are currently alerting. Quiet monitors with pager handles are not flagged. 3. Sorts by page count descending; currently-alerting monitors float to the top within the same count bucket, since they're actively paging right now. 4. Detail includes page count with lookback window, pager tool + handle, renotify_interval if set, and team tag if present: "23 pages (30d), currently alerting via PagerDuty (@pagerduty-payments), re-notifying every 15 min [team:platform]" The Events API call uses the raw JSON path (crate::api::get) since the typed Event model doesn't expose monitor_id — it lands in additional_properties. Degrades silently on API failure so the check doesn't block the rest of pup vet. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implementation of
pup vetfrom discussion #132 — a health check command that surfaces universally broken Datadog configurations without assuming org-specific structure. Singlemonitors listAPI call powers all checks exceptpager-burden, which adds a secondary Events API call for 30-day alert history.Checks
silent-monitors@-mention — alerts fire into the voidstale-monitorsmuted-forgottenuntagged-monitorsno-recovery-thresholdcritical_recovery— flapping riskfast-renotify-intervalrenotify_interval≤60 min — will spam on-call if firedpager-burdenUsage
Human output example:
Agent mode returns a structured JSON envelope with
next_actionguidance when criticals are present.Architecture
pager-burdendetail:@-handles from monitor messages, classifies by tool (DD On-Call@oncall-*, PagerDuty@pagerduty*, OpsGenie@opsgenie-*, VictorOps@victorops-*)/api/v1/events(raw JSON — typedEventmodel omitsmonitor_id)SDK dependency
This PR patches
datadog-api-clientvia[patch.crates-io]pointing at a local fix branch: DataDog/datadog-api-client-rust#1292.The
MonitorThresholdsdeserializer panics withinvalid type: string "", expected f64when the Datadog API returns""instead ofnullfor unset threshold values (service-check and composite monitors). Without the patch,pup vet— and all other monitor commands — fail for any org that has one of these monitors.The
Cargo.tomlpatch entry has aTODO: removecomment so it's easy to clean up once the upstream fix is released.Open questions / next steps
stale-monitors: currently catches the "No Data" snapshot — consider--stale-daysflag usingmonitor.modifiedas a proxy for durationpager-burdenteam grouping: resources are a flat sorted list; human output could group byteam:tag for readability@team-<handle>routes through DD On-Call only if On-Call is configured for that team — currently undetectable from monitor message aloneCloses #132
🤖 Generated with Claude Code