Skip to content

feat(observability): per-scope progress + rate-aware ETA (FDD-OPS-015)#6

Merged
nascimentolimaandre-cloud merged 1 commit intomainfrom
feat/ops-015-observable-ingestion
Apr 29, 2026
Merged

feat(observability): per-scope progress + rate-aware ETA (FDD-OPS-015)#6
nascimentolimaandre-cloud merged 1 commit intomainfrom
feat/ops-015-observable-ingestion

Conversation

@nascimentolimaandre-cloud
Copy link
Copy Markdown
Owner

Summary

Closes AP-5 ("estimate-and-pray") do ingestion-architecture-v2.md. Operadores agora veem progresso live por scope (per Jira project / per GitHub repo) com ETA rate-aware, sem precisar grep em logs.

FDD-OPS-015 · Phase 1.5 do v2 architecture · 1 PR (backend); UI vem na PR B.

Por que esta PR existe

5 incidentes em 2026-04-27/28 onde minhas estimativas de ETA erraram por 10×+. Frustração registrada do usuário: "toda hora estamos caindo nesse cenário". A análise virou ingestion-architecture-v2.md que catalogou AP-5 como o 5º anti-pattern — esta PR materializa o fix sistêmico.

What ships

# Item Files
1 Migration 012 — pipeline_progress table per-scope alembic/versions/012_pipeline_progress.py
2 Pre-flight count helpers (Jira approximate-count, GitHub GraphQL totalCount, Jenkins lighter tree) 3 connectors
3 ProgressTracker — lifecycle + rolling rate + ETA math progress_tracker.py (novo)
4 Worker integration _sync_issues + _sync_pull_requests devlake_sync.py
5 GET /data/v1/pipeline/jobs endpoint + ProgressJob schema routes.py + schemas.py
6 24 unit tests cobrindo rate/ETA/lifecycle/idempotency test_progress_tracker.py

Live validation

Após restart de sync-worker, smoke test:

curl 'http://localhost:8000/data/v1/pipeline/jobs?limit=5'

Retornou 10+ rows GitHub repos com itemsEstimate corretamente populado via GraphQL totalCount (18, 157, 72, 430, 146, 81, 156, 0, 203, ...). Schema camelCase via _CamelModel. isStalled=false para activity recente.

DB:

 entity_type   | scope_key                                         | status  | items_estimate
---------------+---------------------------------------------------+---------+---------------
 pull_requests | github:repo:webmotors-private/portal-turbo-mobile | running |             18
 pull_requests | github:repo:webmotors-private/webmotors.zerokm.ui | running |            157
 ...

Anti-patterns abordados (do v2 architecture)

AP Status
AP-1 (Bulk-fetch-then-persist) Já resolvido em Phase 1 (FDD-OPS-012)
AP-2 (Redundant API calls) Já resolvido em FDD-OPS-013
AP-3 (Sequential phases + global watermark) Já resolvido em FDD-OPS-014 (Phase 2A+2B)
AP-4 (No source isolation) Pendente — Step 2.6 docker-compose split
AP-5 (Estimate-and-pray) Esta PR

Acceptance criteria (FDD-OPS-015)

  • /pipeline/jobs retorna 1 row por scope ativo após 30s
  • Cada row tem status, items_done, ETA, rate
  • ETA accuracy ±20% steady state (validado via teste test_eta_within_20_pct_of_actual_at_steady_state)
  • Stalled detection (running + last_progress >60s ago)
  • Sort/filter por status, entity_type
  • UI tab Pipeline Monitor — PR B (próxima)

Design decisions

  • finish() idempotente (_is_finished flag) — protege contra outer except flipar tracker 'done' → 'failed'
  • Rolling rate via deque (5 samples, oldest→newest) em vez de EMA — mais fácil de raciocinar, robusto a variação de batch size
  • ETA pinned at 0 quando items_done >= estimate (estimate underestimou) — UI mostra 100%, não "ETA: -5min"
  • Persistence failure NÃO bloqueia ingestãotracker._upsert faz log + swallow

Deferido para follow-up

  • _sync_deployments per-scope tracking (Jenkins bulk-fetch precisa refactor; volume baixo justifica deferral)
  • PR B (próxima) — Pipeline Monitor UI tab "Per-scope progress"
  • Cron de retenção (operacional, não código): DELETE WHERE status IN ('done','failed') AND last_progress_at < now() - interval '7 days'

Stats

  • 10 arquivos, +1234 / -1 linhas
  • 24 unit tests novos
  • 142/142 regression verde
  • 1 nova tabela DB (com 3 índices, incluindo partial para stalled detection)

Test plan

  • CI roda 4 gates verde
  • cd packages/pulse-data && pytest tests/unit/test_progress_tracker.py -v → 24 verde
  • make migrate aplica 012 limpo (em ambientes onde 012 ainda não foi aplicado via SQL direto)
  • curl /data/v1/pipeline/jobs retorna lista com camelCase JSON
  • Trigger sync cycle → verifica pipeline_progress populando per scope
  • Após >60s sem activity em scope running → isStalled=true
  • Filter ?status=failed retorna apenas falhos
  • Filter ?entity_type=issues retorna apenas issues

🤖 Generated with Claude Code

Closes the AP-5 anti-pattern from ingestion-architecture-v2.md — operators
no longer need to grep logs to answer "is BG project still progressing or
stuck?". Live progress + ETA per-scope, queryable via API.

WHY THIS PR EXISTS

5 incidents in 2026-04-27/28 where my ETA estimates were wrong by 10×+.
The user's frustration ("toda hora estamos caindo nesse cenário") drove
the v2 architecture (commit c5e38bb) which catalogued AP-5 ("estimate-
and-pray") as the 5th anti-pattern. This PR ships the systemic fix.

WHAT SHIPS

1. Migration 012 — `pipeline_progress` table (per-scope, ~32+ rows during
   Webmotors backfill: one per Jira project + one per GitHub repo)
2. Pre-flight count helpers per connector:
   - JiraConnector.count_issues_for_project (POST /search/approximate-count)
   - GitHubConnector.count_prs_for_repo (GraphQL totalCount)
   - JenkinsConnector.count_builds_for_job (lighter tree query)
   All with 10s timeout + None-fallback so ingestion never blocks on count.
3. ProgressTracker module — encapsulates lifecycle (start_scope → tick →
   finish), rolling-window rate (5 samples), ETA = (estimate - done) / rate
4. Worker integration:
   - _sync_issues: per-project tracker, lazy creation on first sight
   - _sync_pull_requests: per-repo tracker, started on "starting" signal
   - _sync_deployments: deferred to follow-up (volume low ~1.4k builds,
     bulk fetch refactor needed for per-job tracking)
5. GET /data/v1/pipeline/jobs endpoint — list ProgressJob with computed
   `progress_pct` + `is_stalled` (running + last_progress_at >60s ago)

LIVE VALIDATION

Restart sync-worker → curl /pipeline/jobs:
  - 10 GitHub repos with totalCount estimates (18, 157, 72, 430, 146, 81,
    156, 0, 203, ...) populated successfully
  - All status=running, isStalled=false (recent activity)
  - JSON camelCase via _CamelModel base
  - `pipeline_progress` table populated with same shape

DESIGN DECISIONS

- finish() is idempotent (_is_finished flag) — prevents outer except
  flipping a 'done' tracker to 'failed'
- Rolling rate uses oldest→newest in deque, not exponential moving average
  — easier to reason about, robust to batch-size variance
- ETA pinned at 0 when items_done >= estimate (under-counted) instead of
  going negative — UI shows 0%, not "ETA: -5min"
- progress_pct capped at 100.0 for same reason
- Persistence failures in tracker._upsert log + swallow — ingestion MUST
  NOT fail because progress tracking failed

DEFERRED TO FOLLOW-UP

- _sync_deployments per-scope tracking (Jenkins bulk-fetch needs refactor)
- UI tab in Pipeline Monitor (separate PR B per the stacked-PR plan)
- Retention cron for old rows (operational, not code)

TESTS

- 24 unit tests in test_progress_tracker.py covering rate math, ETA
  edges, lifecycle (start/tick/finish), idempotency, Webmotors-shape
  steady-state ETA accuracy
- 142/142 regression verde (no impact on existing tests)

ACCEPTANCE CRITERIA (from FDD-OPS-015)

- [x] /pipeline/jobs returns 1 row per active scope after 30s
- [x] Each row has status, items_done, ETA, rate
- [x] ETA accuracy within ±20% at steady state (verified via
      test_eta_within_20_pct_of_actual_at_steady_state)
- [x] Stalled detection: status='running' + last_progress_at < now - 60s
- [x] Sortable/filterable by status, entity_type
- [ ] UI tab — pending PR B

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nascimentolimaandre-cloud nascimentolimaandre-cloud merged commit b2f73c0 into main Apr 29, 2026
4 checks passed
@nascimentolimaandre-cloud nascimentolimaandre-cloud deleted the feat/ops-015-observable-ingestion branch April 29, 2026 19:20
nascimentolimaandre-cloud pushed a commit that referenced this pull request Apr 29, 2026
…d detection (FDD-OPS-015 UI)

Closes the FDD-OPS-015 deliverable list — backend (PR #6) shipped the API,
this PR ships the operator-facing UI tab "Per-scope" no Pipeline Monitor.

WHAT YOU SEE

A 4ª aba "Per-scope" no `/pipeline-monitor` route mostra **live**:

  - Sumário no header: counts de "running", "stalled", "done", "failed"
  - Tabela com 1 row por scope ativo/recente:
      Scope (jira:project:BG / github:repo:foo/bar)
      Entity type
      Progress bar com itemsDone / itemsEstimate + percentage
      Status badge (running spinner, stalled warning, done check, failed X)
      Rate (items/sec)
      ETA formatado (90s, 5m 23s, 2h 14m)
      Last activity (relative: "12s ago", "3m ago")
  - Filters: by entity_type (issues/prs/deploys/sprints) + by status
  - Polling cada 5s (live update, stale at 2s)

STALLED DETECTION

Quando backend reporta `isStalled=true` (status='running' AND no progress
for >60s), a row ganha:
  - Background tinted warning (subtle yellow)
  - AlertCircle icon
  - "STALLED" badge replaces normal status

Operator vê em segundos qual scope precisa de atenção.

NO-ESTIMATE GRACEFUL HANDLING

Se backend não conseguiu pre-flight count (timeout, source unsupported),
itemsEstimate=null:
  - Progress bar mostra stripe indeterminado (15% width como hint)
  - Pct label exibe "?"
  - ETA exibe "—"

Não trava UI; operador vê "fetching, taxa X/s, total desconhecido".

ANTI-SURVEILLANCE

Schema Zod strict() em testes rejeita `author`/`assignee` no payload —
matches o invariant em metrics-inconsistencies §8.9.

FILES

Frontend:
  - types/pipeline.ts: ProgressJob, ProgressJobStatus, ProgressJobPhase
  - lib/api/pipeline.ts: fetchPipelineJobs (com query params)
  - hooks/usePipeline.ts: usePipelineJobs (5s polling, 2s staleTime)
  - components/pipeline/PerScopeJobs.tsx (NEW, ~330 lines)
  - routes/_dashboard/pipeline-monitor.tsx: 4ª tab "Per-scope"

Tests:
  - tests/contract/schemas/pipeline-jobs.schema.ts (Zod schema)
  - tests/contract/pipeline-jobs-contract.test.ts (13 tests):
      A-G: shape validity (running/no-estimate/stalled/failed/done/empty/array)
      H-I: anti-surveillance (rejects author/assignee)
      J-M: defensive bounds (negative items, >100 pct, unknown enums)

VALIDATION

  ✅ TypeScript build clean (npx tsc --noEmit)
  ✅ ESLint clean (no new warnings)
  ✅ 163/163 frontend tests pass (13 novos + 150 anteriores)
  ✅ Live API smoke test: 10+ scopes returned, isStalled correctly computed
  ✅ JSON wire-shape matches schema (verified via curl during dev)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant