feat(observability): per-scope progress + rate-aware ETA (FDD-OPS-015)#6
Merged
nascimentolimaandre-cloud merged 1 commit intomainfrom Apr 29, 2026
Merged
Conversation
Closes the AP-5 anti-pattern from ingestion-architecture-v2.md — operators
no longer need to grep logs to answer "is BG project still progressing or
stuck?". Live progress + ETA per-scope, queryable via API.
WHY THIS PR EXISTS
5 incidents in 2026-04-27/28 where my ETA estimates were wrong by 10×+.
The user's frustration ("toda hora estamos caindo nesse cenário") drove
the v2 architecture (commit c5e38bb) which catalogued AP-5 ("estimate-
and-pray") as the 5th anti-pattern. This PR ships the systemic fix.
WHAT SHIPS
1. Migration 012 — `pipeline_progress` table (per-scope, ~32+ rows during
Webmotors backfill: one per Jira project + one per GitHub repo)
2. Pre-flight count helpers per connector:
- JiraConnector.count_issues_for_project (POST /search/approximate-count)
- GitHubConnector.count_prs_for_repo (GraphQL totalCount)
- JenkinsConnector.count_builds_for_job (lighter tree query)
All with 10s timeout + None-fallback so ingestion never blocks on count.
3. ProgressTracker module — encapsulates lifecycle (start_scope → tick →
finish), rolling-window rate (5 samples), ETA = (estimate - done) / rate
4. Worker integration:
- _sync_issues: per-project tracker, lazy creation on first sight
- _sync_pull_requests: per-repo tracker, started on "starting" signal
- _sync_deployments: deferred to follow-up (volume low ~1.4k builds,
bulk fetch refactor needed for per-job tracking)
5. GET /data/v1/pipeline/jobs endpoint — list ProgressJob with computed
`progress_pct` + `is_stalled` (running + last_progress_at >60s ago)
LIVE VALIDATION
Restart sync-worker → curl /pipeline/jobs:
- 10 GitHub repos with totalCount estimates (18, 157, 72, 430, 146, 81,
156, 0, 203, ...) populated successfully
- All status=running, isStalled=false (recent activity)
- JSON camelCase via _CamelModel base
- `pipeline_progress` table populated with same shape
DESIGN DECISIONS
- finish() is idempotent (_is_finished flag) — prevents outer except
flipping a 'done' tracker to 'failed'
- Rolling rate uses oldest→newest in deque, not exponential moving average
— easier to reason about, robust to batch-size variance
- ETA pinned at 0 when items_done >= estimate (under-counted) instead of
going negative — UI shows 0%, not "ETA: -5min"
- progress_pct capped at 100.0 for same reason
- Persistence failures in tracker._upsert log + swallow — ingestion MUST
NOT fail because progress tracking failed
DEFERRED TO FOLLOW-UP
- _sync_deployments per-scope tracking (Jenkins bulk-fetch needs refactor)
- UI tab in Pipeline Monitor (separate PR B per the stacked-PR plan)
- Retention cron for old rows (operational, not code)
TESTS
- 24 unit tests in test_progress_tracker.py covering rate math, ETA
edges, lifecycle (start/tick/finish), idempotency, Webmotors-shape
steady-state ETA accuracy
- 142/142 regression verde (no impact on existing tests)
ACCEPTANCE CRITERIA (from FDD-OPS-015)
- [x] /pipeline/jobs returns 1 row per active scope after 30s
- [x] Each row has status, items_done, ETA, rate
- [x] ETA accuracy within ±20% at steady state (verified via
test_eta_within_20_pct_of_actual_at_steady_state)
- [x] Stalled detection: status='running' + last_progress_at < now - 60s
- [x] Sortable/filterable by status, entity_type
- [ ] UI tab — pending PR B
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 29, 2026
nascimentolimaandre-cloud
pushed a commit
that referenced
this pull request
Apr 29, 2026
…d detection (FDD-OPS-015 UI) Closes the FDD-OPS-015 deliverable list — backend (PR #6) shipped the API, this PR ships the operator-facing UI tab "Per-scope" no Pipeline Monitor. WHAT YOU SEE A 4ª aba "Per-scope" no `/pipeline-monitor` route mostra **live**: - Sumário no header: counts de "running", "stalled", "done", "failed" - Tabela com 1 row por scope ativo/recente: Scope (jira:project:BG / github:repo:foo/bar) Entity type Progress bar com itemsDone / itemsEstimate + percentage Status badge (running spinner, stalled warning, done check, failed X) Rate (items/sec) ETA formatado (90s, 5m 23s, 2h 14m) Last activity (relative: "12s ago", "3m ago") - Filters: by entity_type (issues/prs/deploys/sprints) + by status - Polling cada 5s (live update, stale at 2s) STALLED DETECTION Quando backend reporta `isStalled=true` (status='running' AND no progress for >60s), a row ganha: - Background tinted warning (subtle yellow) - AlertCircle icon - "STALLED" badge replaces normal status Operator vê em segundos qual scope precisa de atenção. NO-ESTIMATE GRACEFUL HANDLING Se backend não conseguiu pre-flight count (timeout, source unsupported), itemsEstimate=null: - Progress bar mostra stripe indeterminado (15% width como hint) - Pct label exibe "?" - ETA exibe "—" Não trava UI; operador vê "fetching, taxa X/s, total desconhecido". ANTI-SURVEILLANCE Schema Zod strict() em testes rejeita `author`/`assignee` no payload — matches o invariant em metrics-inconsistencies §8.9. FILES Frontend: - types/pipeline.ts: ProgressJob, ProgressJobStatus, ProgressJobPhase - lib/api/pipeline.ts: fetchPipelineJobs (com query params) - hooks/usePipeline.ts: usePipelineJobs (5s polling, 2s staleTime) - components/pipeline/PerScopeJobs.tsx (NEW, ~330 lines) - routes/_dashboard/pipeline-monitor.tsx: 4ª tab "Per-scope" Tests: - tests/contract/schemas/pipeline-jobs.schema.ts (Zod schema) - tests/contract/pipeline-jobs-contract.test.ts (13 tests): A-G: shape validity (running/no-estimate/stalled/failed/done/empty/array) H-I: anti-surveillance (rejects author/assignee) J-M: defensive bounds (negative items, >100 pct, unknown enums) VALIDATION ✅ TypeScript build clean (npx tsc --noEmit) ✅ ESLint clean (no new warnings) ✅ 163/163 frontend tests pass (13 novos + 150 anteriores) ✅ Live API smoke test: 10+ scopes returned, isStalled correctly computed ✅ JSON wire-shape matches schema (verified via curl during dev) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes AP-5 ("estimate-and-pray") do
ingestion-architecture-v2.md. Operadores agora veem progresso live por scope (per Jira project / per GitHub repo) com ETA rate-aware, sem precisar grep em logs.FDD-OPS-015 · Phase 1.5 do v2 architecture · 1 PR (backend); UI vem na PR B.
Por que esta PR existe
5 incidentes em 2026-04-27/28 onde minhas estimativas de ETA erraram por 10×+. Frustração registrada do usuário: "toda hora estamos caindo nesse cenário". A análise virou
ingestion-architecture-v2.mdque catalogou AP-5 como o 5º anti-pattern — esta PR materializa o fix sistêmico.What ships
pipeline_progresstable per-scopealembic/versions/012_pipeline_progress.pyProgressTracker— lifecycle + rolling rate + ETA mathprogress_tracker.py(novo)_sync_issues+_sync_pull_requestsdevlake_sync.pyGET /data/v1/pipeline/jobsendpoint +ProgressJobschemaroutes.py+schemas.pytest_progress_tracker.pyLive validation
Após restart de sync-worker, smoke test:
curl 'http://localhost:8000/data/v1/pipeline/jobs?limit=5'Retornou 10+ rows GitHub repos com
itemsEstimatecorretamente populado via GraphQL totalCount (18, 157, 72, 430, 146, 81, 156, 0, 203, ...). Schema camelCase via_CamelModel.isStalled=falsepara activity recente.DB:
Anti-patterns abordados (do v2 architecture)
Acceptance criteria (FDD-OPS-015)
test_eta_within_20_pct_of_actual_at_steady_state)Design decisions
finish()idempotente (_is_finishedflag) — protege contra outer except flipar tracker 'done' → 'failed'items_done >= estimate(estimate underestimou) — UI mostra 100%, não "ETA: -5min"tracker._upsertfaz log + swallowDeferido para follow-up
_sync_deploymentsper-scope tracking (Jenkins bulk-fetch precisa refactor; volume baixo justifica deferral)DELETE WHERE status IN ('done','failed') AND last_progress_at < now() - interval '7 days'Stats
Test plan
cd packages/pulse-data && pytest tests/unit/test_progress_tracker.py -v→ 24 verdemake migrateaplica 012 limpo (em ambientes onde 012 ainda não foi aplicado via SQL direto)curl /data/v1/pipeline/jobsretorna lista com camelCase JSONpipeline_progresspopulando per scopeisStalled=true?status=failedretorna apenas falhos?entity_type=issuesretorna apenas issues🤖 Generated with Claude Code