feat(metrics): MTTR Phase 1 — pairing failure→success in eng_deployments (FDD-DSH-050)#11
Open
nascimentolimaandre-cloud wants to merge 1 commit intomainfrom
Open
feat(metrics): MTTR Phase 1 — pairing failure→success in eng_deployments (FDD-DSH-050)#11nascimentolimaandre-cloud wants to merge 1 commit intomainfrom
nascimentolimaandre-cloud wants to merge 1 commit intomainfrom
Conversation
…nts (FDD-DSH-050)
Resolves INC-005 (MTTR sempre null). DORA's 4th metric now has a source
without adding a new ingestion process — pairing is derived from data we
already collect (Jenkins prod deploys, INC-008 environment filter).
Schema (migration 013_mttr_incident_pairing):
- eng_deployments + recovered_by_deploy_id (UUID FK self, ON DELETE SET NULL)
- eng_deployments + superseded_by_deploy_id (UUID FK self)
- eng_deployments + incident_status VARCHAR(16) with CHECK
('open' | 'resolved' | 'superseded' | NULL)
- 2 partial indexes (mttr_pairing on prod, open_incidents)
Pairing (services/backfill_mttr.py):
- CTE classifies every prod failure: walk back to find chain anchor
(LAG + correlated subquery), LATERAL join to next success in window.
- Anchor model: failure row owns recovery_time_hours; success stays
untouched (single-row update site, idempotent).
- Back-to-back failures → 'superseded' pointing at first-in-chain.
- No recovery in 7d window → 'open'.
Calculation (domain/dora.calculate_mttr):
- Flaky filter: recovery_time_hours >= 5/60h (5 min) — Jenkins re-trigger
shouldn't deflate the median.
- Sample-size guard: n >= 5 resolved incidents required, else None.
- DoraMetrics exposes mttr_incident_count + mttr_open_incident_count
for UI context ("P50 = 0.5h, n=73 resolved, 3 open").
Forward-hook (workers/devlake_sync._sync_deployments):
- Calls pair_recent_incidents(since=now()) after upsert; non-fatal.
Admin endpoint:
- POST /data/v1/admin/deployments/refresh-mttr (X-Admin-Token).
scope: 'all' | 'stale' | 'last-90d'; dry_run; max_failures.
Anti-surveillance:
- Structural test source-greps calculate_mttr body for forbidden refs
(author/assignee/reporter/user/committer). Fails build on regression.
Live Webmotors smoke (2026-04-29):
- 255 prod failures classified in 1.14s
→ 84 resolved + 148 superseded + 23 open
- After flaky filter: 73 real incidents, P50=0.50h (Elite), P90=16.58h.
Tests: 16 new unit tests (median, sample guard, flaky filter, open
incidents, DoraMetrics counts integration, anti-surveillance source-grep).
Full regression: 182/182 unit tests passing.
Phase 2 backlog (deferred): Jira Bug overlay (depende INC-026/INC-027),
GitHub label enrichment, PagerDuty/Opsgenie webhooks, per-team MTTR
breakdown (segue FDD-DSH-060), open_window_days configurável por team.
See docs/fdd/FDD-DSH-050-mttr-design.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
null). DORA's 4th metric now has a source — pure derivation from data we already collect, no new ingestion process.(repo, environment)within a 7d open-incident window. Failure row ownsrecovery_time_hours(anchor model, idempotent).calculate_mttr: 5-min flaky filter (Jenkins re-trigger guard) + minimum samplen ≥ 5.What ships
013_mttr_incident_pairing— 3 columns oneng_deployments(recovered_by_deploy_id,superseded_by_deploy_id,incident_status) + CHECK + 2 partial indexesservices/backfill_mttr.py— single CTE walks back to chain anchor, LATERAL joins next success in window, classifiesresolved/open/supersededdomain/dora.calculate_mttr— flaky filter (≥ 5 min) + sample guard (n ≥ 5);DoraMetricsexposesmttr_incident_count+mttr_open_incident_count_sync_deploymentscallspair_recent_incidentsafter upsert (non-fatal)POST /data/v1/admin/deployments/refresh-mttr(X-Admin-Token, scopeall/stale/last-90d, dry_run)docs/fdd/FDD-DSH-050-mttr-design.mdconsolidates data-scientist + data-engineer specs; INC-005 and dashboard backlog updatedAnti-surveillance
Structural test (
test_calculate_mttr_only_reads_aggregable_fields) source-grepscalculate_mttrbody and fails the build if the function ever referencesauthor,assignee,reporter,user, orcommitter. MTTR operates only on(repo, environment, timestamps, is_failure)tuples.Live smoke (Webmotors, 2026-04-29)
Phase 2 — deferred backlog
Phase 1 lands at the DevLake-equivalent tier with extra safeguards. Phase 2 (per user decision) goes to backlog:
hotfix,revert,P0,P1)open_window_daysconfigurável por teamTest plan
pytest tests/unit/test_mttr_calculation.py -q→ 16/16 passingpendingLabel="R1"on the MTTR card, rendern_resolved/n_opennext to value (see §13 of design doc)🤖 Generated with Claude Code