feat(metrics): MTTR Phase 1 — pairing failure→success in eng_deployments (FDD-DSH-050) by nascimentolimaandre-cloud · Pull Request #11 · nascimentolimaandre-cloud/pulse

nascimentolimaandre-cloud · 2026-05-01T00:20:41Z

Summary

Resolves INC-005 (MTTR sempre null). DORA's 4th metric now has a source — pure derivation from data we already collect, no new ingestion process.
Pairs each prod FAILURE deploy with the next SUCCESS on the same (repo, environment) within a 7d open-incident window. Failure row owns recovery_time_hours (anchor model, idempotent).
Adds DORA-canonical filters to calculate_mttr: 5-min flaky filter (Jenkins re-trigger guard) + minimum sample n ≥ 5.

What ships

Layer	Change
Schema	Migration `013_mttr_incident_pairing` — 3 columns on `eng_deployments` (`recovered_by_deploy_id`, `superseded_by_deploy_id`, `incident_status`) + CHECK + 2 partial indexes
Pairing	`services/backfill_mttr.py` — single CTE walks back to chain anchor, LATERAL joins next success in window, classifies `resolved`/`open`/`superseded`
Calc	`domain/dora.calculate_mttr` — flaky filter (≥ 5 min) + sample guard (`n ≥ 5`); `DoraMetrics` exposes `mttr_incident_count` + `mttr_open_incident_count`
Forward-hook	`_sync_deployments` calls `pair_recent_incidents` after upsert (non-fatal)
Admin	`POST /data/v1/admin/deployments/refresh-mttr` (X-Admin-Token, scope `all`/`stale`/`last-90d`, dry_run)
Tests	16 unit tests across 5 classes (median, sample guard, flaky filter, open incidents, counts integration, anti-surveillance source-grep)
Docs	`docs/fdd/FDD-DSH-050-mttr-design.md` consolidates data-scientist + data-engineer specs; INC-005 and dashboard backlog updated

Anti-surveillance

Structural test (test_calculate_mttr_only_reads_aggregable_fields) source-greps calculate_mttr body and fails the build if the function ever references author, assignee, reporter, user, or committer. MTTR operates only on (repo, environment, timestamps, is_failure) tuples.

Live smoke (Webmotors, 2026-04-29)

255 prod failures classified in 1.14s
   84 → resolved
  148 → superseded (back-to-back chains)
   23 → open

After flaky filter (recovery_time_hours ≥ 5 min):
   73 real incidents
   P50 = 0.50h   ← Elite (DORA 2023)
   P90 = 16.58h

Phase 2 — deferred backlog

Phase 1 lands at the DevLake-equivalent tier with extra safeguards. Phase 2 (per user decision) goes to backlog:

Jira Bug / Incident overlay (depende INC-026/INC-027)
GitHub label enrichment (hotfix, revert, P0, P1)
PagerDuty / Opsgenie webhooks
Per-team MTTR breakdown (segue FDD-DSH-060)
open_window_days configurável por team

Test plan

pytest tests/unit/test_mttr_calculation.py -q → 16/16 passing
Full unit regression → 182/182 passing
Live admin smoke against Webmotors data → 255 failures classified, P50=0.50h verified
Migration applied to live DB; ORM↔DB schema drift guard passing
Frontend follow-up (separate PR): remove pendingLabel="R1" on the MTTR card, render n_resolved / n_open next to value (see §13 of design doc)

🤖 Generated with Claude Code

…nts (FDD-DSH-050) Resolves INC-005 (MTTR sempre null). DORA's 4th metric now has a source without adding a new ingestion process — pairing is derived from data we already collect (Jenkins prod deploys, INC-008 environment filter). Schema (migration 013_mttr_incident_pairing): - eng_deployments + recovered_by_deploy_id (UUID FK self, ON DELETE SET NULL) - eng_deployments + superseded_by_deploy_id (UUID FK self) - eng_deployments + incident_status VARCHAR(16) with CHECK ('open' | 'resolved' | 'superseded' | NULL) - 2 partial indexes (mttr_pairing on prod, open_incidents) Pairing (services/backfill_mttr.py): - CTE classifies every prod failure: walk back to find chain anchor (LAG + correlated subquery), LATERAL join to next success in window. - Anchor model: failure row owns recovery_time_hours; success stays untouched (single-row update site, idempotent). - Back-to-back failures → 'superseded' pointing at first-in-chain. - No recovery in 7d window → 'open'. Calculation (domain/dora.calculate_mttr): - Flaky filter: recovery_time_hours >= 5/60h (5 min) — Jenkins re-trigger shouldn't deflate the median. - Sample-size guard: n >= 5 resolved incidents required, else None. - DoraMetrics exposes mttr_incident_count + mttr_open_incident_count for UI context ("P50 = 0.5h, n=73 resolved, 3 open"). Forward-hook (workers/devlake_sync._sync_deployments): - Calls pair_recent_incidents(since=now()) after upsert; non-fatal. Admin endpoint: - POST /data/v1/admin/deployments/refresh-mttr (X-Admin-Token). scope: 'all' | 'stale' | 'last-90d'; dry_run; max_failures. Anti-surveillance: - Structural test source-greps calculate_mttr body for forbidden refs (author/assignee/reporter/user/committer). Fails build on regression. Live Webmotors smoke (2026-04-29): - 255 prod failures classified in 1.14s → 84 resolved + 148 superseded + 23 open - After flaky filter: 73 real incidents, P50=0.50h (Elite), P90=16.58h. Tests: 16 new unit tests (median, sample guard, flaky filter, open incidents, DoraMetrics counts integration, anti-surveillance source-grep). Full regression: 182/182 unit tests passing. Phase 2 backlog (deferred): Jira Bug overlay (depende INC-026/INC-027), GitHub label enrichment, PagerDuty/Opsgenie webhooks, per-team MTTR breakdown (segue FDD-DSH-060), open_window_days configurável por team. See docs/fdd/FDD-DSH-050-mttr-design.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(metrics): MTTR Phase 1 — pairing failure→success in eng_deployments (FDD-DSH-050)#11

feat(metrics): MTTR Phase 1 — pairing failure→success in eng_deployments (FDD-DSH-050)#11
nascimentolimaandre-cloud wants to merge 1 commit intomainfrom
feat/fdd-dsh-050-mttr-phase-1

nascimentolimaandre-cloud commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nascimentolimaandre-cloud commented May 1, 2026

Summary

What ships

Anti-surveillance

Live smoke (Webmotors, 2026-04-29)

Phase 2 — deferred backlog

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant