Skip to content

feat(metrics): MTTR Phase 1 — pairing failure→success in eng_deployments (FDD-DSH-050)#11

Open
nascimentolimaandre-cloud wants to merge 1 commit intomainfrom
feat/fdd-dsh-050-mttr-phase-1
Open

feat(metrics): MTTR Phase 1 — pairing failure→success in eng_deployments (FDD-DSH-050)#11
nascimentolimaandre-cloud wants to merge 1 commit intomainfrom
feat/fdd-dsh-050-mttr-phase-1

Conversation

@nascimentolimaandre-cloud
Copy link
Copy Markdown
Owner

Summary

  • Resolves INC-005 (MTTR sempre null). DORA's 4th metric now has a source — pure derivation from data we already collect, no new ingestion process.
  • Pairs each prod FAILURE deploy with the next SUCCESS on the same (repo, environment) within a 7d open-incident window. Failure row owns recovery_time_hours (anchor model, idempotent).
  • Adds DORA-canonical filters to calculate_mttr: 5-min flaky filter (Jenkins re-trigger guard) + minimum sample n ≥ 5.

What ships

Layer Change
Schema Migration 013_mttr_incident_pairing — 3 columns on eng_deployments (recovered_by_deploy_id, superseded_by_deploy_id, incident_status) + CHECK + 2 partial indexes
Pairing services/backfill_mttr.py — single CTE walks back to chain anchor, LATERAL joins next success in window, classifies resolved/open/superseded
Calc domain/dora.calculate_mttr — flaky filter (≥ 5 min) + sample guard (n ≥ 5); DoraMetrics exposes mttr_incident_count + mttr_open_incident_count
Forward-hook _sync_deployments calls pair_recent_incidents after upsert (non-fatal)
Admin POST /data/v1/admin/deployments/refresh-mttr (X-Admin-Token, scope all/stale/last-90d, dry_run)
Tests 16 unit tests across 5 classes (median, sample guard, flaky filter, open incidents, counts integration, anti-surveillance source-grep)
Docs docs/fdd/FDD-DSH-050-mttr-design.md consolidates data-scientist + data-engineer specs; INC-005 and dashboard backlog updated

Anti-surveillance

Structural test (test_calculate_mttr_only_reads_aggregable_fields) source-greps calculate_mttr body and fails the build if the function ever references author, assignee, reporter, user, or committer. MTTR operates only on (repo, environment, timestamps, is_failure) tuples.

Live smoke (Webmotors, 2026-04-29)

255 prod failures classified in 1.14s
   84 → resolved
  148 → superseded (back-to-back chains)
   23 → open

After flaky filter (recovery_time_hours ≥ 5 min):
   73 real incidents
   P50 = 0.50h   ← Elite (DORA 2023)
   P90 = 16.58h

Phase 2 — deferred backlog

Phase 1 lands at the DevLake-equivalent tier with extra safeguards. Phase 2 (per user decision) goes to backlog:

  1. Jira Bug / Incident overlay (depende INC-026/INC-027)
  2. GitHub label enrichment (hotfix, revert, P0, P1)
  3. PagerDuty / Opsgenie webhooks
  4. Per-team MTTR breakdown (segue FDD-DSH-060)
  5. open_window_days configurável por team

Test plan

  • pytest tests/unit/test_mttr_calculation.py -q → 16/16 passing
  • Full unit regression → 182/182 passing
  • Live admin smoke against Webmotors data → 255 failures classified, P50=0.50h verified
  • Migration applied to live DB; ORM↔DB schema drift guard passing
  • Frontend follow-up (separate PR): remove pendingLabel="R1" on the MTTR card, render n_resolved / n_open next to value (see §13 of design doc)

🤖 Generated with Claude Code

…nts (FDD-DSH-050)

Resolves INC-005 (MTTR sempre null). DORA's 4th metric now has a source
without adding a new ingestion process — pairing is derived from data we
already collect (Jenkins prod deploys, INC-008 environment filter).

Schema (migration 013_mttr_incident_pairing):
- eng_deployments + recovered_by_deploy_id (UUID FK self, ON DELETE SET NULL)
- eng_deployments + superseded_by_deploy_id (UUID FK self)
- eng_deployments + incident_status VARCHAR(16) with CHECK
  ('open' | 'resolved' | 'superseded' | NULL)
- 2 partial indexes (mttr_pairing on prod, open_incidents)

Pairing (services/backfill_mttr.py):
- CTE classifies every prod failure: walk back to find chain anchor
  (LAG + correlated subquery), LATERAL join to next success in window.
- Anchor model: failure row owns recovery_time_hours; success stays
  untouched (single-row update site, idempotent).
- Back-to-back failures → 'superseded' pointing at first-in-chain.
- No recovery in 7d window → 'open'.

Calculation (domain/dora.calculate_mttr):
- Flaky filter: recovery_time_hours >= 5/60h (5 min) — Jenkins re-trigger
  shouldn't deflate the median.
- Sample-size guard: n >= 5 resolved incidents required, else None.
- DoraMetrics exposes mttr_incident_count + mttr_open_incident_count
  for UI context ("P50 = 0.5h, n=73 resolved, 3 open").

Forward-hook (workers/devlake_sync._sync_deployments):
- Calls pair_recent_incidents(since=now()) after upsert; non-fatal.

Admin endpoint:
- POST /data/v1/admin/deployments/refresh-mttr (X-Admin-Token).
  scope: 'all' | 'stale' | 'last-90d'; dry_run; max_failures.

Anti-surveillance:
- Structural test source-greps calculate_mttr body for forbidden refs
  (author/assignee/reporter/user/committer). Fails build on regression.

Live Webmotors smoke (2026-04-29):
- 255 prod failures classified in 1.14s
  → 84 resolved + 148 superseded + 23 open
- After flaky filter: 73 real incidents, P50=0.50h (Elite), P90=16.58h.

Tests: 16 new unit tests (median, sample guard, flaky filter, open
incidents, DoraMetrics counts integration, anti-surveillance source-grep).
Full regression: 182/182 unit tests passing.

Phase 2 backlog (deferred): Jira Bug overlay (depende INC-026/INC-027),
GitHub label enrichment, PagerDuty/Opsgenie webhooks, per-team MTTR
breakdown (segue FDD-DSH-060), open_window_days configurável por team.

See docs/fdd/FDD-DSH-050-mttr-design.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant