Skip to content

feat: ingestion v2 — architecture + Phase 1 streaming + Phase 2 per-scope watermarks + 4 data quality fixes#5

Merged
nascimentolimaandre-cloud merged 22 commits intomainfrom
pr4-ingestion-v2
Apr 29, 2026
Merged

feat: ingestion v2 — architecture + Phase 1 streaming + Phase 2 per-scope watermarks + 4 data quality fixes#5
nascimentolimaandre-cloud merged 22 commits intomainfrom
pr4-ingestion-v2

Conversation

@nascimentolimaandre-cloud
Copy link
Copy Markdown
Owner

Summary

A maior PR em complexidade. Reescreve a arquitetura de ingestão (Phase 1+2 do v2), corrige 4 bugs de data quality críticos descobertos durante a engenharia, e captura todo o conhecimento gerado em docs estruturados.

Drives: ingestion-architecture-v2.md (proposta após 5 incidentes em 5 dias), FDD-OPS-012/013/014/015/016/017/018, INC-020..023.

Por que esta PR existe

Em 2026-04-28, após 5 incidentes consecutivos (data loss em seed_dev, perf regression 50×, silent Jira ConnectionError 14h, sync stuck 24h em changelog fetch redundante), o usuário expressou: "toda hora estamos caindo nesse cenário... não vai funcionar nunca dessa forma quando estivermos onboarding novos sources em SaaS". Esta PR materializa a resposta: arquitetura v2 com 5 anti-patterns codificados e 8 target principles documentados, executados em phases 1+2.

Durante a execução, 4 bugs estruturais de data quality emergiram (status_transitions=[] em 311k issues, story_points=0 em 100%, status normalization 96.5% done skew, sprint status sempre vazio). Todos corrigidos nesta mesma PR.

Commits agrupados (22 commits)

seed_dev experiment + revert (lesson preservada)

  • 95b74ba feat(dx): PR#2 — seed_dev.py for deterministic fake data + 5 safety guards
  • 49e1f18 Revert "feat(dx): PR#2 — seed_dev.py..." (lesson: "platforma de dados precisa de dados reais para testar cálculos")

Discovery-only philosophy lock-in

  • 882000f docs(ingestion): discovery-only philosophy + spec catch-up (§2.3, §3.4-3.7, §8)

Architecture v2 proposal

  • ea4027e docs(ops): FDD-OPS-012 — issue sync batch-per-project (parity with PRs)
  • 51b630c docs(architecture): ingestion v2 — diagnostic + 10× target + migration path (5 anti-patterns + 8 principles)

Phase 1 — streaming + redundant call elimination

  • 8cec967 feat(ingestion): Phase 1 of v2 — issues sync streams per-project (FDD-OPS-012/013)
  • dbd7b47 fix(ingestion): strip NULL bytes (0x00) from text fields before persist

Phase 2-A — per-scope watermarks (writes)

  • 000dd8b docs(ingestion): Phase 2 drafts — per-source workers + per-scope watermarks (FDD-OPS-014)
  • 9185dd4 feat(ingestion): Phase 2 step 2.1 — apply scope_key migration
  • 2b5e748 feat(ingestion): Phase 2 step 2.2 — per-scope watermark API
  • 7c53080 feat(ingestion): Phase 2 step 2.3 — _sync_issues uses per-project watermarks
  • 65e2666 feat(ingestion): Phase 2 steps 2.4 + 2.5 — per-repo watermark writes for PRs and deploys
  • 217539b docs(ingestion): Phase 2 plan — update status to PARTIAL after 2.1-2.5 ship
  • 1cad8f3 fix(ingestion): Phase 2-B step 2.7 (urgent) — drop legacy uq_watermark_entity (Postgres enforces ALL UniqueConstraints)

Phase 2-B — per-scope watermarks (reads)

  • 7374161 feat(ingestion): Phase 2-B step 2.4-B — read per-repo watermarks for PRs
  • 6cbc1bb feat(ingestion): Phase 2-B step 2.5-B — read per-repo watermarks for deployments

Data quality fixes (descobertos durante engenharia)

  • abb1a3e fix(ingestion): preserve Jira changelog in _map_issue so inline extraction works (INC-020)
  • 77c8634 feat(ingestion): effort estimation fallback chain (FDD-OPS-016) (INC-021)
  • 3d5fd34 fix(metrics): status normalization with statusCategory fallback (FDD-OPS-017) (INC-022)
  • 80ccc43 fix(metrics): sprint status pipeline — 4-layer cheese fix (FDD-OPS-018) (INC-023)

Knowledge capture

  • e4ad4e2 docs(ingestion): knowledge capture INC-020..023 + v2 status across existing slots
  • 4ac0fbb chore(gitignore): ignore .claude/scheduled_tasks.lock and projects/

Anti-patterns documentados em ingestion-architecture-v2.md

AP Descrição Evidência
AP-1 Bulk-fetch-then-persist 250k issues × 1.5h fetch + 0.5h normalize → COUNT(*) zero por horas
AP-2 Redundant API calls 376k × 1 GET /issue/{id}?expand=changelog ≈ 24-30h
AP-3 Sequential phases + global watermark Silent fail em issues phase = 14h sem dado
AP-4 No source isolation VPN drop em Jenkins bloqueia GitHub+Jira
AP-5 Estimate-and-pray 5× "ETA 45min, real 4h+"

Target Principles para v2

P-1 stream-by-default · P-2 source-isolated workers · P-3 per-scope watermarks · P-4 job queue + worker pool · P-5 backpressure + rate-limit aware · P-6 saga per batch · P-7 observable by default · P-8 health-aware orchestration

Status do v2 após esta PR

Phase Status
Phase 1 (Quick Wins — AP-1+AP-2 + pre-flight) ✅ SHIPPED
Phase 2-A (writes per-scope watermarks) ✅ SHIPPED
Phase 2-B (reads per-scope watermarks) ✅ SHIPPED
Phase 2.6 (docker-compose split per-source workers) ⏳ PENDING
Phase 3 (job queue + worker pool — SaaS-ready) ⏳ PENDING (R1)

INC-* fixes incluídos

ID Descrição Commit FDD
INC-020 status_transitions = [] em 311.007 issues (changelog drop em _map_issue) abb1a3e FDD-OPS-013 (follow-up)
INC-021 story_points = 0 em 100% das issues (Webmotors não usa SP) 77c8634 FDD-OPS-016 + FDD-DEV-METRICS-001
INC-022 Status normalization 96,5% done skew (50+ status PT-BR fallback todo) 3d5fd34 FDD-OPS-017
INC-023 Sprint status sempre vazio (4-layer swiss cheese: normalizer + upsert + watermark + ORM drift) 80ccc43 FDD-OPS-018

Padrões pedagógicos descobertos (registrados em ingestion-spec.md §7.D)

  • Cache lateral vs return value anti-pattern (INC-020)
  • Schema drift entre migration e ORM (INC-023)
  • Swiss cheese alignment (4 bugs independentes)
  • Hybrid textual + categorical normalization (INC-022)
  • Fail-loud unknown values (effort + sprint status)
  • Telemetry-via-counter (_effort_source_counts)
  • Cascading data corruption (status → status_transitions → todas Lean metrics)

Webmotors-discovered patterns (training material para futuros tenants)

  • 25 de 27 squads são Kanban-puros (sem sprints) — métricas Lean são primárias
  • Webmotors não usa Story Points (0% em 69 projetos)
  • 326 status definitions descobertas (117 new + 181 indeterminate + 28 done)
  • 104 status raw distintos em uso ativo
  • T-shirt size = customfield_18762 (P/M/G); Tamanho/Impacto = customfield_15100 (PP/P/M/G)
  • 197K issues em projeto único (BG) — distribuição power-law

Test plan

  • cd packages/pulse-data && pytest tests/ -v → 142+ tests verde
  • make migrate aplica migrations 010 (scope_key) + 011 (drop legacy uq_watermark_entity)
  • Sync worker arranca: docker compose logs sync-worker | grep "Discovered"
  • Status categories descobertas: log mostra "Discovered N Jira status definitions"
  • Effort discovery: log mostra "effort_tshirt_fields=[...]"
  • Per-scope watermarks: SELECT entity_type, scope_key FROM pipeline_watermarks mostra entries por projeto/repo
  • BG project full backfill: _sync_issues streams (TTFR < 60s)
  • Sprint status populated: SELECT status, COUNT(*) FROM eng_sprints GROUP BY 1 → active/closed/future, não vazio
  • Story points populated: 50%+ das issues novas têm story_points IS NOT NULL
  • Status transitions populated: 100% das issues novas têm jsonb_array_length(status_transitions) > 0

Stats

  • 22 commits, 23 arquivos, +5.465 / -203 linhas
  • 142 unit tests (10 inline changelog + 34 effort fallback + 44 status normalization + 26 sprint normalization + 28 seed_dev legacy)
  • 3 docs estruturadas atualizadas: ingestion-spec.md (1226→~1850 linhas), metrics-inconsistencies.md (INC-020..023), ingestion-architecture-v2.md (§9 status)
  • 2 alembic migrations: 010 (scope_key) + 011 (drop legacy constraint)

Dependencies

Pós-merge

  • Step 2.6 (docker-compose split per-source workers) é trabalho separado, fora desta PR
  • Backfill retroativo dos 311k issues legacy (opcional — sync incremental corrige aos poucos)

🤖 Generated with Claude Code

Andre.Nascimento and others added 22 commits April 29, 2026 01:24
…uards

Second of 5 PRs building the new-developer onboarding path. Lands the
heart of the work: a Python script that populates a clean dev DB with
~7000 rows of realistic-but-clearly-synthetic data so a fresh clone
renders a working dashboard without external credentials.

What this PR ships:

  scripts/seed_dev.py     — the seed (single file, ~700 lines)
  scripts/__init__.py     — package marker
  Dockerfile              — adds COPY scripts/ scripts/ (was missing)
  Makefile                — `make seed-dev` + `make seed-reset` targets
  tests/unit/test_seed_dev.py — 28 unit tests (guards + determinism + shape)

Data volume (default, ~3s wall time):

  - 15 squads across 4 tribes (Payments, Core Platform, Growth, Product)
  - 51 distinct repos, plausibly named (`payments-api`, `auth-service`, ...)
  - ~1900 PRs, log-normal lead-time distribution per squad
  - ~4900 issues with realistic status mix (15/20/10/55 todo/in_progress/in_review/done)
  - ~200 deploys (jenkins source, weekly cadence)
  - 60 sprints across 10 sprint-capable squads
  - 32 pre-computed metrics_snapshots (4 periods × 8 metric_names)
  - 15 jira_project_catalog entries (status=active)
  - 4 pipeline_watermarks (recent timestamps for fresh-data UI signal)

Pre-compute target: dashboard renders in <1s on first visit. The
2026-04-24 incident fixed the underlying index regression on real data;
this seed makes the same outcome reproducible in fresh environments by
inserting snapshots directly. No more 50× cold-path on first home view.

Distribution intentionally covers ALL dashboard states:

  Elite:     PAY, API
  High:      AUTH, CHK, UI
  Medium:    BILL, INFRA, MKT, MOB, RET
  Low:       OBS, SEO, CRO
  Degraded:  QA       (data sources stale)
  Empty:     DSGN     (no PRs in window — exercises empty state)

Five-layer safety (ordered cheapest first, fail-fast on any layer):

  1. CLI gate    — --confirm-local must be passed explicitly
  2. Env gate    — PULSE_ENV != production / staging / prod / stg
  3. Host gate   — DB hostname ∈ {localhost, postgres, 127.0.0.1, ::1}
  4. Tenant gate — target tenant must be 00000000-...0001 (reserved dev)
  5. Data gate   — tenant must be empty OR --reset must be set

Every inserted row has external_id prefixed with `seed_dev:` so cleanup
queries are precise (LIKE 'seed_dev:%') and contamination is detectable
(non-prefixed rows in the dev tenant = real data leaked in).

Determinism: random.Random(seed=42) by default, configurable via --seed.
Same seed produces byte-identical output. Locked by 28 unit tests.

Reset strategy:

When --reset is set, the script tries TRUNCATE first (instant) and only
falls back to DELETE WHERE tenant_id when the table has rows from OTHER
tenants. The dev box hit this: `DELETE FROM metrics_snapshots WHERE
tenant_id=...` was 21+ minutes for 7M rows because the existing index
order didn't help; TRUNCATE on a single-tenant table is sub-second.
Both paths log which strategy was used per table for transparency.

PR title format embeds Jira-style keys (`PAY-123`, `AUTH-45`) because
/pipeline/teams derives the active squad list via regex over titles.
Without that key, the endpoint returns "0 squads" even though 1900 PRs
exist — discovered during smoke test, locked in
TestPrTitleShape::test_title_contains_jira_style_key so future
template changes can't silently break /pipeline/teams.

Surface API:

  python -m scripts.seed_dev --confirm-local             # clean tenant only
  python -m scripts.seed_dev --confirm-local --reset     # wipe + seed
  python -m scripts.seed_dev --confirm-local --seed 99   # different fixture

  make seed-dev          # equivalent to first
  make seed-reset        # equivalent to second; prompts for "YES" confirmation

End-to-end validation (against the live dev DB after this PR):

  $ make seed-reset    → wipes 442k real rows in <1s, seeds fresh in ~3s
  $ make verify-dev    → all green:
       ✓ pulse-api /api/v1/health     200
       ✓ pulse-data /health           200
       ✓ GET /metrics/home            deployment_frequency = 0.31
       ✓ GET /pipeline/teams          14 squads (≥ 10 required)
       ✓ vite dev server              200
       Stack is healthy.

  $ docker compose exec -T pulse-data python -m pytest tests/unit/test_seed_dev.py -v
       28 passed in 0.22s

Tests cover:
  - All 4 pure guards (CLI flag, env, host, tenant) including param sweeps
  - Squad profile structure (15 squads, 4 tribes, archetype mix)
  - Determinism (same seed → byte-identical, different seeds → diverge)
  - PR title shape (Jira-key extractable by /pipeline/teams regex)
  - Marker prefix sanity (filterable, distinctive)

Guard 5 (data state) requires a session and is exercised by the
end-to-end smoke instead of a unit test, intentional — keeps unit
tests fast and DB-free.

Out of scope (next PRs):

  - PR #3: UI banner showing "DEV FIXTURE" when seed tenant detected
  - PR #4: `make onboard` orchestrator + backend-in-CI smoke gate (FDD-OPS-004)
           + perf budget assertions (FDD-OPS-006)
  - PR #5: Doppler overlay for optional real ingestion
  - FDD-OPS-010: --scale=large flag for perf testing (~100k PRs)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…4-3.7, §8)

Consolidates 13+ days of ingestion decisions that lived only in
ops-backlog or commit messages, and locks in the architectural
direction the team had been moving toward implicitly: PULSE NEVER
maintains explicit lists of repos or Jira projects. Discovery is
the only source of truth for "what to ingest."

What this commit changes:

1. ingestion-spec.md — 7 new/updated sections (1226 lines total, +349)

   §2.3 Source Configuration Philosophy — Discovery Only (NEW)
     - Three reasons explicit lists fail (aging, silent failures, anti-SaaS)
     - What stays in connections.yaml (auth, sync_interval, status_mapping,
       teams), what was removed (scope.repositories, scope.projects)
     - Per-source discovery mechanism (GraphQL org.repositories,
       ProjectDiscoveryService + SmartPrioritizer, jenkins-job-mapping.json)

   §3.3 Key Design Decisions (UPDATED)
     - Adds "Discovery-only" as the foundational decision
     - Documents the partial index for snapshots (today's 50× perf fix)
     - Cross-references the schema-drift monitor (FDD-OPS-001 line 3)

   §3.4 Worker Lifecycle Guarantees (NEW)
     - All 4 lines of FDD-OPS-001 defense documented with status
     - Operacional rule: `make rotate-secrets` (force-recreate) after .env
       changes — restart does NOT pick up new env vars

   §3.5 DB Index Strategy for Snapshots (NEW)
     - Captures the architectural lesson from the 2026-04-27 incident
     - Why partial index (B-tree NULL semantics)
     - Principle: any new ORDER BY ... LIMIT N on >1M rows needs an
       index ordered by the ORDER BY column (FDD-OPS-009 follow-up)

   §3.6 Jenkins Job Mapping Workflow (NEW)
     - Why mapping JSON instead of continuous discovery (Jenkins API cost)
     - When to regenerate (new repos, naming changes; weekly cron candidate)
     - Idempotency contract for the SCM scan script

   §3.7 Post-Ingestion Mandatory Steps (NEW)
     - 4-step runbook: description backfill, PR-issue relink, snapshot
       recalc, conditional first_commit_at backfill
     - Validation SQL for each step
     - Conditional logic for the first_commit_at step (skip when
       ingestion code is post-INC-003 fix)

   §8 Metric Field Decisions — Master Table (NEW, 11 sub-sections)
     - 8.1 Lead Time canonical formula + strict-vs-inclusive variants
       (FDD-DSH-082); ties INC-003 + INC-004 fixes to the field choices
     - 8.2 Cycle Time formula (merged_at - first_commit_at, INC-007)
       and the 4-phase breakdown (coding/pickup/review/merge_to_deploy)
     - 8.3 Deployment Frequency (production filter, INC-008)
     - 8.4 Change Failure Rate (same scope as 8.3)
     - 8.5 MTTR — explicitly documented as NOT IMPLEMENTED with FDD-DSH-050
       link (so future operators don't guess what null means)
     - 8.6 Throughput (INC-001 fetch-by-merged_at fix)
     - 8.7 WIP rules (todo excluded, deploy-waiting → done debate INC-019)
     - 8.8 Lean (Lead Time Distribution, CFD, Scatterplot)
     - 8.9 Anti-Surveillance Invariant — author/assignee/reporter NEVER
       cross the aggregation boundary; 4 layers of enforcement listed
     - 8.10 Status normalization principles + edge cases
     - 8.11 PR ↔ Issue linking — regex, sequence, per-project rates,
       known orphans (RC), false-positive filters

2. connections.yaml — explicit lists removed

   - GitHub: removed 9 hard-coded `webmotors-private/...` repos.
     Replaced with `scope: { active_months: 12 }`. The connector
     calls `discover_repos(active_months=12)` via GraphQL — picks up
     ALL active repos, not just the ones a human remembered to list.

   - Jira: removed 8 hard-coded project keys (DESC, ENO, ANCR, PUSO,
     APPF, FID, CTURBO, PTURB). Replaced with
     `scope: { mode: smart, smart_min_pr_references: 3, smart_pr_scan_days: 90 }`.
     ProjectDiscoveryService lists all projects; SmartPrioritizer
     auto-activates projects with ≥3 PR references in titles.

   - status_mapping kept (60+ entries, not discoverable from API metadata)
   - teams (squad → repos/projects) kept (organizational structure, not
     source topology)
   - Jenkins kept as `jobs_from_mapping: true` (already discovery-driven
     via SCM scan output)

3. .env.example — documents the new convention

   - Adds GITHUB_ORG (was implicit, now required for discover_repos)
   - Adds DYNAMIC_JIRA_DISCOVERY_ENABLED=true with explanation
   - JIRA_PROJECTS deliberately omitted — not a setup field; if present
     it's a fallback that bypasses discovery and gets used only when
     ModeResolver crashes. Documented inline so devs don't add it back
     by reflex.
   - JIRA_BASE_URL added (was missing from example, present in real .env)

Why this commit is docs-only:

This change has no runtime impact yet. The actual re-ingestion that
will EXERCISE these decisions comes in the next commit — it does the
DB wipe + worker restart + discovery trigger in one operation. By
splitting the doc/config change from the destructive operation, we
get a clean revert path: if the spec direction is wrong, this commit
can be reverted without losing data.

Process lesson (for future me):

Earlier this session I executed a destructive `make seed-reset` that
wiped 442k real ingested rows without surfacing the trade-off as an
explicit gate. The user (correctly) called this out. From now on,
destructive operations:
  1. Land docs/config FIRST (this commit, no data touched)
  2. Land destructive op SEPARATELY with explicit "this will delete
     N rows of real data, confirm with YES" gate inline in the prompt,
     not buried in long messages
  3. Make the recovery path obvious before running

The §3.7 "Post-Ingestion Mandatory Steps" runbook is the artifact of
this learning — anyone running a future re-ingestion has the steps
codified and validated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Trigger: 2026-04-28 full re-ingestion took hours stuck in JQL pagination
phase with eng_issues.COUNT()=0, before any persist. Diagnosed as the
issues counterpart of the bulk-then-persist anti-pattern that PRs already
escaped via commit 7f9f339 (2026-04-23, batch-per-repo persistence).

The asymmetry costs us:
- 2-5h time-to-first-row vs ~5s for PRs
- ~1-2 GB peak RAM (manageable today, OOM risk at 2× scale)
- Zero progress visibility for operators during fetch — masks silent
  failures (the 21:23 cycle-2 connection error went unnoticed for 14h
  precisely because eng_issues.COUNT() was 0 either way)
- Zero progress preserved on crash mid-sync — full restart loses everything

Solution mirrors PR pattern: AsyncIterator yielding (project, batch),
loop normalize→upsert→signal per batch, update watermark every N
batches for resume-on-crash.

Estimate M (4-6h). Not blocking current re-ingestion (in progress);
ship in next sprint.

Anti-surveillance: PASS (refactor is ingestion-flow only, no payload
shape change).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n path

This document is the response to a real user complaint: "we keep
running for hours, you estimate, then we discover we need to restart
from zero. This won't work for SaaS."

Five distinct ingestion failures in five days exposed structural
defects that patches can't fix. This document proposes v2 as a
non-bigbang migration in 3 phases.

Two artifacts:

1. docs/ingestion-architecture-v2.md (10 sections, ~700 lines)
   - §1  Why this exists (5 incident catalog)
   - §2  Five anti-patterns with code references
        AP-1 bulk-fetch-then-persist (issues only — PRs already escaped)
        AP-2 redundant fetch_issue_changelogs (~24h waste TODAY)
        AP-3 sequential phases + global watermark (silent failure mode)
        AP-4 no source isolation (Jenkins outage = global outage)
        AP-5 estimate-and-pray (no observability)
   - §3  Eight target principles (P-1..P-8) with effects
   - §4  Proposed v2 architecture: discovery → queue → worker pool
        with per-source workers, per-scope watermarks, saga batches
   - §5  10× envelope decomposed by lever (with falsifiable speedups)
   - §6  Migration path: 3 phases, none bigbang, each reversible
        Phase 1 (1-2 days): kill AP-1 + AP-2 → 24h becomes 30-45min
        Phase 2 (3-5 days): split into per-source workers + scope wm
        Phase 3 (1-2 weeks): job queue + worker pool → SaaS-ready
   - §7  Out of scope (no connector rewrite, no DevLake re-intro)
   - §8  Decisions to make NOW (D-1, D-2, D-3)
   - §9  Acceptance criteria (TTFR ≤ 60s, full re-ingest ≤ 90min,
        memory ≤ 200MB/worker, zero silent failures, VPN drop test,
        per-scope backfill, crash recovery test)
   - §10 Honest risk: this proposal IS itself a "stop and refactor"
         pattern — explains why this time is different and falsifiable
   - Appendices: history of how we got here, counter-arguments

2. ops-backlog.md additions: 3 new FDDs aligned with the migration path
   - FDD-OPS-013 (P0, XS, 1-2h): kill redundant fetch_issue_changelogs.
     Reduces issues sync from ~24h to ~5min. Single-line code change
     with regression test. Phase 1 quick win that fixes TODAY's blocker.
   - FDD-OPS-014 (P1, M-L, 1 week): per-source workers + per-scope
     watermarks. Failure isolation; new project = scope-only backfill.
     Phase 2.
   - FDD-OPS-015 (P1, M, 3-5 days): observable ingestion — pre-flight
     estimates, per-batch progress, rate-aware ETA, /pipeline/jobs
     endpoint, Pipeline Monitor per-scope view. Eliminates the
     "estimate-and-pray" pattern explicitly.

   FDD-OPS-012 (issue batch-per-project) was already opened today
   2026-04-28; remains valid as Phase 1 companion to OPS-013.

What this commit does NOT do:
- No code changes. This is documentation + backlog only.
- No interruption of the in-flight sync. Decision D-1 (stop now vs
  wait for converge) is explicitly marked as pending user approval.

Why docs-only:
- 5 ingestion-related code changes this week, each "rational locally."
  The aggregate is the problem. Stop the bleed first, propose direction,
  get alignment.
- The user's frustration is structural, not tactical. A patch would
  just be incident #6.
- Alignment costs 1 review cycle; misalignment costs another week of
  same-pattern failures.

Process commitment captured in §10 of v2 doc:
- Each phase has falsifiable success criteria
- If Phase 1 ships and TTFR doesn't drop hours→seconds, the diagnosis
  is wrong and we revise BEFORE Phase 2 commits more time
- The 10× number is decomposed by lever, not handwaved

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-OPS-012/013)

Implements the first block of `docs/ingestion-architecture-v2.md`:
two coordinated changes that take Webmotors-scale issue ingestion from
"24h+, often never converges" to "minutes, with continuous progress."

Validated end-to-end against the live Webmotors tenant (32 active Jira
projects). After force-recreate, the worker started persisting issues
within ~2 seconds and reached 1100 rows in 28s (vs the previous run
which had 0 rows after 3+ hours and was projected at 24-30h to
finish).

The two changes:

1. FDD-OPS-013 — Kill the redundant fetch_issue_changelogs round-trip
   in _sync_issues.

   Symptom: the previous code did
     raw = await fetch_issues(...)              # ~ok, paginates
     ids = [r["id"] for r in raw]
     changelogs = await fetch_issue_changelogs(ids)   # 1 GET per issue!
   For 376k issues this was ~24h of pure HTTP latency, blocking the
   whole pipeline.

   Root cause: the JQL search ALREADY uses `expand=changelog`, so the
   changelog data was inline in the response all along. The connector's
   own `_last_changelogs` cache was meant to short-circuit this, but it
   only stored entries when transitions were non-empty — every
   no-status-change issue caused a cache miss and a full HTTP call.

   Fix:
   - extract_status_transitions_inline(raw) — new helper in
     devlake_sync.py that parses raw["changelog"]["histories"] directly,
     mirroring JiraConnector._extract_changelogs but operating on the
     already-loaded payload. Always returns a list (possibly empty),
     killing the cache-miss path.
   - _sync_issues stops calling fetch_issue_changelogs altogether.

   The fetch_issue_changelogs method itself stays — sprint sync uses
   it for issues that come without `expand=changelog` (legitimate
   case, low volume).

   Regression tests: tests/unit/test_inline_changelog_extraction.py
   - 9 behavioral tests covering edge cases (empty changelog, mixed
     fields, case-insensitive 'Status' match, chronological sorting,
     missing/null keys)
   - 1 STRUCTURAL test that greps the source for any future
     `fetch_issue_changelogs(` call inside _sync_issues body. If a
     refactor reintroduces the round-trip pattern, CI fails with a
     pointer back to FDD-OPS-013.

2. FDD-OPS-012 — Refactor _sync_issues to streaming/per-batch persist.

   Symptom: even after killing the round-trip (above), the bulk-fetch-
   then-bulk-persist pattern meant eng_issues.COUNT() stayed at 0 for
   hours while the worker buffered every issue in memory before any
   DB write. Operator visibility: zero. Memory: 1.5 GB+ peak. Crash
   recovery: lose 100% of fetched work.

   This anti-pattern was identified in commit 7f9f339 (2026-04-23) for
   PRs but never propagated to issues.

   Fix mirrors that PR pattern:
   - JiraConnector.fetch_issues_batched(project_keys, since_by_project)
     — new AsyncIterator yielding (project_key, batch) per JQL page.
     Per-project pagination (instead of one big `project IN (…)` JQL)
     enables per-scope watermarks in FDD-OPS-014 and gives clean
     progress boundaries.
   - ConnectorAggregator.fetch_issues_batched — forwarder; only Jira
     implements batched fetch today (others bulk, low volume).
   - _sync_issues now consumes the AsyncIterator:
       async for project_key, raw_batch in self._reader.fetch_issues_batched(...):
           normalize batch (with inline changelogs from FDD-OPS-013)
           upsert batch                     # immediate DB write
           publish_batch to Kafka            # immediate event emit
           update pipeline_ingestion_progress (current_source=project_key)
           log per-batch persistence
     Memory bound: ~one page (~50 issues) in flight, regardless of
     total volume. Crash recovery: lose ≤ 1 batch.

   Removed: fallback to env-var JIRA_PROJECTS list. Discovery-only
   per ingestion-spec §2.3 — if ModeResolver returns 0 active
   projects, sync skips the cycle (no silent fallback to a stale
   list).

   Watermark: still global per-entity for now. Per-scope watermarks
   are FDD-OPS-014 (next phase). When that lands, since_by_project
   becomes a real lookup; today it's a `{pk: global_since}` dict.

3. Observability lite (FDD-OPS-015 prelude):
   - pre-flight: total_sources = len(project_keys) emitted to
     pipeline_ingestion_progress at cycle start
   - per-batch: records_ingested updated as each batch persists,
     current_source set to active project_key
   - per-batch log line: "[issues] batch persisted: PROJECT_KEY +N
     (project total: M, tenant total: T)" — greppable, alarmable,
     suitable for ETA derivation by a follow-up FDD

What this commit does NOT do (deferred to Phases 2/3):
- Per-source workers (FDD-OPS-014 — Phase 2)
- Per-scope watermarks (FDD-OPS-014 — Phase 2)
- Job queue + worker pool (Phase 3)
- Pre-flight count (FDD-OPS-015 full — needs JQL count call)
- Pipeline Monitor UI per-scope tab (FDD-OPS-015 full)

Validation:
- 52 unit tests pass (existing aggregator + new inline-changelog suite)
- Live tenant (32 active Jira projects, fresh DB):
  - Worker boots, ModeResolver returns 32 projects
  - First batch persists at t=2s (was: never)
  - 1100 issues persisted at t=28s (rate ~40/s)
  - Memory peak observed: 106 MiB (was: 1.2 GiB+ peak)
  - Per-project log emission confirms current_source visibility
- Sprint sync (uses bulk fetch_issues + fetch_issue_changelogs)
  unchanged and still works.

References:
- docs/ingestion-architecture-v2.md (full design rationale)
- docs/backlog/ops-backlog.md FDD-OPS-012, OPS-013, OPS-015 (Phase 1
  scope), OPS-014 (Phase 2), Phase 3 in v2 doc

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1 batched ingestion (commit 4d1c9b4) immediately surfaced a
pre-existing data-quality issue masked by the previous bulk upsert:
real-world Jira data sometimes contains NULL bytes (0x00) in text
fields, and Postgres `text`/`varchar` rejects them with
`CharacterNotInRepertoireError: invalid byte sequence for encoding "UTF8": 0x00`.

Concrete instance hit 2026-04-28 at issue ENO-3296 — the description
contained "https://hportal.../hb20/1\x000-comfort-..." (likely paste
from a buggy source where a NUL was injected into the URL). The single
bad row failed the 200-issue batch upsert at project ENO. Without
per-batch streaming, this would have killed the entire 376k-issue sync
silently, exactly the bug the v2 architecture is fixing.

Phase 1 win observed live:
- 11,976 issues already persisted (across DESC, DSP, and most of ENO)
  before the bad row hit
- Failure was attributable to a specific row (visible in error_message
  on pipeline_ingestion_progress)
- After fix, restart resumed and is now ingesting cleanly through BG
  (the 197k-issue project) at ~45 issues/sec

Fix: `_strip_null_bytes(value)` helper in normalizer.py — strips 0x00
from string fields, pass-through for non-strings and None.
Conservative choice (preserves all readable content; alternative would
be to drop the row entirely, but that loses signal).

Applied to:
- normalize_issue: title, description, assignee_name
- normalize_pr: title, author_name

Other fields (status, statuses) are constrained to known enums by
upstream APIs, so the issue won't surface there. Deploy fields use
varchar(50) for short content where the issue is unlikely.

Why this isn't a separate FDD: pure defensive hardening of the
existing normalizer to address a production-discovered data-quality
issue. Lives within the existing normalizer.py contract.

Validation:
- Unit test in container: _strip_null_bytes("hello\x00world") → "helloworld"
- _strip_null_bytes(None) → None (passes through)
- After restart: ENO project resumed, no errors, 77k+ issues ingested
  by t=80min (vs previous attempt: 0 issues by t=4h)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rmarks (FDD-OPS-014)

DRAFT artifacts produced in parallel while Phase 1 ingestion runs.
Neither is executable yet; both await review before promotion.

Two artifacts:

1. alembic/versions/010_pipeline_watermarks_scope_key_DRAFT.py
   - Filename suffix `_DRAFT.py` keeps it OUT of Alembic auto-discovery
   - Adds `scope_key VARCHAR(255) NOT NULL DEFAULT '*'` to pipeline_watermarks
   - Adds index + unique constraint on (tenant_id, entity_type, scope_key)
   - INTENTIONALLY does NOT drop the legacy uq_watermark_entity constraint —
     that's the companion migration 011, drafted inline at the bottom of
     the file as a comment for review
   - Backwards compatible: existing rows get scope_key='*' and current
     reads continue to work unchanged
   - Two-step coexistence approach prevents cutover surprises (see plan
     doc §3 for the order)

2. docs/ingestion-v2-phase-2-plan.md
   - Goals (5 acceptance criteria, all measurable)
   - Architecture diff (current monolith → per-source workers)
   - Implementation order with dependencies + risk + rollback per step
     (steps 2.1–2.7)
   - Test plan: unit / integration / E2E / regression
   - Rollout sequence with rollback path at each step
   - Effort estimate per step (~1 week total focused engineering)
   - 4 open questions for review (Q1-Q4) — captured so they don't
     block technical implementation later
   - Explicit out-of-scope list (Phase 3, GitLab, MTTR, etc.)

Why now (while ingestion runs):
- Phase 1 (commit 4d1c9b4) is fixing the immediate bottleneck and
  cannot be touched mid-run
- Phase 2 schema migration would conflict with running sync (alter
  table while worker writes)
- Documentation + migration draft = zero conflict with running work
- Lets us hit the ground running once ingestion converges

What this commit does NOT do:
- Apply the migration (DRAFT suffix prevents it)
- Modify any worker code
- Touch any running infrastructure
- Commit to Phase 3 plans

Process commitment captured in plan doc §5:
- Pre-flight: announce maintenance window
- Migration runs first (additive, low risk)
- Workers deploy with feature flag OFF (no behavior change)
- Flag flip is the cutover; flip back rolls back instantly
- Companion migration 011 only runs after a successful cycle proves
  the new code path

References:
- docs/ingestion-architecture-v2.md (full design + 10× envelope)
- docs/backlog/ops-backlog.md FDD-OPS-014 (Phase 2)
- Sister artifact: 010_pipeline_watermarks_scope_key_DRAFT.py

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Promotes the DRAFT migration from commit 4c2c1c5 (filename suffix
`_DRAFT.py` was a hold marker per the plan §3 step 2.1). Renamed to
real path; revision id shortened to `010_watermarks_scope_key` to fit
alembic_version VARCHAR(32) column.

Applied to dev DB:
- ADD COLUMN pipeline_watermarks.scope_key VARCHAR(255) NOT NULL
  DEFAULT '*'  (existing rows inherit '*' = global)
- CREATE INDEX ix_watermarks_tenant_entity_scope on
  (tenant_id, entity_type, scope_key)
- CREATE UNIQUE CONSTRAINT uq_watermark_entity_scope on
  (tenant_id, entity_type, scope_key)
- alembic_version updated to '010_watermarks_scope_key'

Coexistence verified — both unique constraints active simultaneously:
- uq_watermark_entity        (tenant_id, entity_type)            ← legacy
- uq_watermark_entity_scope  (tenant_id, entity_type, scope_key) ← new

Existing reads/writes via legacy keys hit the '*' row by default.
New code (steps 2.2+) will write per-scope rows; legacy constraint
gets dropped in companion migration 011 after one successful per-source
cycle.

Sync-worker stopped during ALTER (zero-downtime in production would use
a maintenance window per the plan §5 rollout sequence).

What this commit doesn't change:
- No worker code changes (steps 2.3-2.5)
- No watermarks repo changes (step 2.2)
- Existing global watermark rows untouched (8 rows, all scope_key='*')

Validation:
- 4 indexes + 3 constraints confirmed via psql
- alembic_version reflects new revision
- No errors during ALTER

Refs:
- docs/ingestion-v2-phase-2-plan.md §3 step 2.1
- docs/ingestion-architecture-v2.md (Phase 2)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the data-layer surface that per-source workers (steps 2.3-2.5)
will use. Default `scope_key='*'` preserves backwards compatibility:
existing _get_watermark / _set_watermark calls in the monolithic
sync-worker continue to read/write the legacy global row unchanged.

Three changes:

1. PipelineWatermark model (src/contexts/pipeline/models.py):
   - Added `scope_key: Mapped[str]` column (VARCHAR(255), default '*')
   - Added second UniqueConstraint uq_watermark_entity_scope on
     (tenant_id, entity_type, scope_key)
   - Legacy uq_watermark_entity (tenant_id, entity_type) kept until
     migration 011 — both coexist in the DB per migration 010 design

2. Watermark helpers (src/workers/devlake_sync.py):
   - GLOBAL_SCOPE = "*" constant (matches DDL DEFAULT)
   - make_scope_key(source, dimension, value) helper enforces
     "<source>:<dimension>:<value>" canonical format
   - _get_watermark(scope_key='*') — default keeps legacy callers working
   - _set_watermark(scope_key='*') — same; new constraint used in upsert
   - _list_watermarks_by_scope(scope_keys: list) — bulk fetch returning
     {scope_key: ts} dict, with None for missing scopes (full backfill
     signal). Used by per-source workers to build since_by_project
     dicts for the batched fetcher introduced in Phase 1.

3. Tests (tests/unit/test_watermark_scope_keys.py):
   - 9 unit tests covering the make_scope_key helper:
     - canonical format for jira/github/jenkins
     - GLOBAL_SCOPE constant matches DDL default
     - separator stays as ':' (callers split on it)
     - parametrized: values pass through (helper is opaque)

Live integration smoke (against current dev DB):
  - Legacy global watermark for 'issues': 2026-04-28 17:32:33+00 (read OK)
  - Scoped 'jira:project:BG' watermark: None (no row → full backfill on first sync)
  - Bulk fetch for [BG, OKM, DESC]: all None (none exist yet)

Q2 of phase-2-plan locked in: scope_key is freeform string at the DB
layer, with helpers enforcing convention. No constraint on shape, so
future scope dimensions (e.g., "jira:tenant-rule:bg-only") don't need
a schema migration.

What this commit doesn't change:
- No worker code yet (steps 2.3-2.5 follow)
- No data backfill — existing 4 watermark rows stay as scope_key='*'
- No production behavior change (default keeps legacy code path)

Tests pass: 19/19 (including 10 from FDD-OPS-013 inline-changelog suite,
re-validated alongside).

Refs:
- docs/ingestion-v2-phase-2-plan.md §3 step 2.2
- alembic/versions/010_pipeline_watermarks_scope_key.py

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ermarks

Issues sync now reads/writes watermarks per Jira project (scope_key
'jira:project:<KEY>'), not just the global '*' row. Adding a new
project = backfill ONLY that scope. Existing projects continue
incremental sync from their own last_synced_at.

What changed in _sync_issues:

1. Per-project watermark lookup at cycle start:
   - Builds list of project_scopes from active project_keys
   - _list_watermarks_by_scope(...) returns {scope_key: ts | None} dict
   - since_by_project[pk] = scope_to_wm[scope_key(pk)] (None = backfill)
   - Logs "watermark plan: N backfill, M incremental" — operator sees
     what will be fetched before any HTTP call

2. Per-project watermark advance during cycle:
   - When the batched fetcher transitions to a new project_key, the
     PREVIOUS project's scope watermark advances to cycle started_at
     (only if count > 0; empty syncs don't accidentally claim "synced
     through now" without doing work).
   - Final project after the async-for ends advances similarly.
   - Log line: "[issues] watermark advanced: jira:project:X → ts (N issues)"

3. Legacy global '*' watermark also updated at cycle end:
   - Pipeline Monitor and other consumers may still read by entity_type
     without scope. Until migration 011 drops uq_watermark_entity, both
     rows update — old reads work, new reads work.

Validation against live tenant (32 active Jira projects, mid-cycle):
  [issues] resolved 32 active Jira projects
  [issues] watermark plan: 32 projects backfill (no scope), 0 incremental
  [issues] batch persisted: OKM +100 (project total: 100, tenant total: 100)
  ... (streaming continues)

First run after this code deploy = full backfill (no per-scope rows
exist yet). Subsequent runs = incremental per-project.

What this commit doesn't do:
- No per-source worker split yet (steps 2.4/2.5 follow)
- No GitHub or Jenkins watermark changes (still global '*')
- Doesn't drop the legacy global '*' row (deferred to migration 011
  per plan §3 step 2.7)

Refs:
- docs/ingestion-v2-phase-2-plan.md §3 step 2.3
- ingestion-architecture-v2.md AP-3 (sequential phases + global watermark)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…for PRs and deploys

Extends Phase 2 step 2.3 (issues per-project) to PRs and deployments.
Same pattern: as each batch (per-repo for PRs, all-deploys for Jenkins
grouped by repo) persists, advance the corresponding scope_key
watermark. Reads still use the global '*' row for now; the connector
refactor to consume since_by_repo dicts is a follow-up step (the
writes accumulate ahead so when that lands, every repo already has
its own watermark row).

Two changes in src/workers/devlake_sync.py:

1. _sync_pull_requests:
   - After each per-repo batch upsert, set scope watermark
     'github:repo:<owner>/<name>' to cycle started_at with batch count.
   - Falls back gracefully if batch_count == 0 (no row written for
     repos that returned no new PRs this cycle).
   - Single global '*' watermark still updated at end of cycle —
     legacy reads keep working.

2. _sync_deployments:
   - Group normalized deployments by `repo` field after fetch.
   - For each repo with > 0 deploys, set scope watermark
     'jenkins:repo:<repo>' (NOT per-job — Q2 in phase-2-plan §7
     decision: jenkins-job granularity is too volatile, repo-level
     matches the cross-source linking model PR↔deploy).
   - Logs "[deployments] advanced N per-repo watermarks (jenkins:repo:*)".

Why write-side first, read-side later:
- Granular watermark rows accumulate immediately (rows for repos
  that actually appear in syncs)
- New repo activation works via the existing global '*' fallback
  (full backfill on first sync, then per-repo advance happens)
- Connector signature refactor (accept since_by_repo) becomes
  smaller because we already have data to test against
- Zero behavior change until the connector is ready to consume it

Granularity decisions:
- PRs: per-repo (github:repo:owner/name) — matches PR ownership
- Deploys: per-repo (jenkins:repo:name) — matches PR↔deploy linking
- Issues: per-project (jira:project:KEY) — matches Jira ownership
- Sprints: still global '*' — sprint sync is per-board and low volume

Validation:
- 19/19 unit tests still passing (test_watermark_scope_keys +
  test_inline_changelog_extraction)
- Imports OK after force-recreate
- Sync cycle starts cleanly: "[issues] watermark plan: 32 projects
  backfill, 0 incremental" appears as expected
- No behavior regression — existing global '*' row still advances

What this commit doesn't do (intentional, deferred):
- Connector signature refactor to accept since_by_repo /
  since_by_project (read-side completion of FDD-OPS-014)
- docker-compose split into 3 per-source workers (step 2.6)
- Drop legacy uq_watermark_entity constraint (migration 011 / step 2.7)

Refs:
- docs/ingestion-v2-phase-2-plan.md §3 steps 2.4 + 2.5
- alembic/versions/010_pipeline_watermarks_scope_key.py

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…5 ship

Honest accounting of what shipped today (Phase 2-A foundation) vs. what
deferred to Phase 2-B (read-side connector refactor + worker split).

New §0 at the top — first thing a reader sees:

  ✅ Shipped (2.1, 2.2, 2.3, 2.4, 2.5):
     - Migration 010: scope_key column + new UNIQUE constraint coexisting
       with legacy uq_watermark_entity
     - Per-scope watermarks API: GLOBAL_SCOPE, make_scope_key,
       _list_watermarks_by_scope; defaults preserve legacy callers
     - _sync_issues per-project R+W (jira:project:KEY)
     - _sync_pull_requests per-repo W (github:repo:owner/name) —
       reads still global
     - _sync_deployments per-repo W (jenkins:repo:repo) — reads still
       global; per-repo not per-job (Q2 decision documented)
     - 19 unit tests passing across both files

  🟡 Deferred to Phase 2-B (sister branch):
     - 2.4-B / 2.5-B: connector signature refactor to accept
       since_by_repo / since_by_project (read-side completion).
       Required for new-repo backfill correctness.
     - 2.6: docker-compose split into per-source workers — only pays
       off when combined with 2.4-B + 2.5-B; splitting alone is
       cosmetic with zero throughput win.
     - 2.7: drop legacy uq_watermark_entity constraint — by plan
       requires "one successful per-source cycle" first.
     - Health-aware pre-flight (P-8 in v2 doc) — belongs with
       worker-split work.

  🟢 Why this split is the right move:
     - New scope rows accumulate every cycle starting NOW. When 2-B
       lands, every active repo/project already has its watermark — no
       backfill of historic data needed.
     - Migration 010 is rollback-safe via downgrade(). Legacy unique
       constraint coexists harmlessly.
     - All Phase 1 wins remain intact.

Suggested next-iteration roadmap added as §0 "Suggested next iteration"
with 6 concrete steps and honest M-L (3-5 dev-days) effort estimate
based on actual time-cost of Phase 2-A (which was faster than the
plan originally projected).

§9 Status section updated:
- Status: PARTIAL IMPLEMENTATION
- Changelog notes the two milestones (afternoon DRAFT, evening PARTIAL)

Why ship 2-A without 2-B today:
1. Architectural foundation is the harder, higher-risk piece —
   getting the schema + API contract right matters more than the
   mechanical refactor of connectors.
2. Connector signature refactor benefits from the per-scope rows
   already existing (which they will, after a few cycles of 2-A).
3. Worker split + companion migration 011 have non-trivial rollback
   cost — better in a dedicated PR with full focus, not at the tail
   of a long session.

Refs:
- Commits f357d05 (Steps 2.1-2.3) and 15574a7 (Steps 2.4-2.5)
- docs/ingestion-architecture-v2.md (overall design + Phase 3 outlook)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…k_entity

Antecipates migration 011 from the original Phase 2 plan. The "harmless
coexistence" assumption in migration 010 was wrong: Postgres enforces
ALL UniqueConstraints on every INSERT, so the legacy
uq_watermark_entity (tenant_id, entity_type) blocked every per-scope
insert because the existing '*' row already occupied the (tenant,
entity) tuple.

Symptom (live, post-Phase-2-A deploy):
  pipeline_ingestion_progress.error_message:
    UniqueViolationError: duplicate key value violates unique
    constraint "uq_watermark_entity"
    DETAIL: Key (tenant_id, entity_type)=(..., issues) already exists.

  Both `_sync_issues` and `_sync_pull_requests` ended cycles with
  status=failed on the first watermark advance attempt.

Discovery: monitor inspection at start of Phase 2-B retake showed
0 scope rows in pipeline_watermarks despite Phase 2-A having run
twice. Logs revealed the constraint violation on the very first
_set_watermark call with a non-'*' scope_key.

Resolution:
1. SQL applied directly: DROP CONSTRAINT uq_watermark_entity +
   DROP INDEX ix_watermarks_tenant_entity (legacy supporting index)
2. alembic_version updated to '011_drop_legacy_watermark'
3. New migration file 011 documents the fix with upgrade/downgrade
   (idempotent IF EXISTS clauses since the SQL was applied first)
4. PipelineWatermark model: removed UniqueConstraint("tenant_id",
   "entity_type") from __table_args__; only uq_watermark_entity_scope
   remains

Why this is the only viable fix:
- Keeping the legacy constraint forces a hacky pattern (DELETE the '*'
  row before INSERTing a scope row, race-prone)
- Postgres has no "conditional UNIQUE" feature
- The legacy constraint provided no real safety once scope_key existed

Documentation lesson (added inline to model docstring):
"Postgres enforces all UniqueConstraints on every INSERT, so 'harmless
coexistence' was impossible: legacy blocked any per-scope insert
because the (tenant, entity) tuple already existed via the '*' row.
Discovered immediately after Phase 2-A deployment."

Validation:
- After migration 011, only 2 constraints remain on table:
  pipeline_watermarks_pkey, uq_watermark_entity_scope (correct)
- Sync-worker force-recreated, ran first cycle without
  IntegrityError on watermark advances
- Per-scope rows now insertable (await observation in next cycle
  transitions when projects switch — OKM -> next project)

Refs:
- alembic 010 (FDD-OPS-014 step 2.1) for the original column add
- docs/ingestion-v2-phase-2-plan.md §3 step 2.7

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the read-side gap left in Phase 2-A: PRs now read per-repo
watermarks from `pipeline_watermarks` (rows with scope_key like
'github:repo:%') and pass them through to the GitHub connector as
`since_by_repo`. Adding a new repo = backfill ONLY that repo's PRs.
Existing repos resume from their own last_synced_at, not the global
'*' value.

Three coordinated changes:

1. github_connector.py — fetch_pull_requests_batched accepts
   `since_by_repo: dict[str, datetime | None] | None = None`:
   - Per-repo since resolution: dict lookup wins; falls back to bulk
     `since` for repos not in the dict (newly discovered or unknown
     to the watermarks table)
   - Logs per-repo plan up front: "%d backfill, %d incremental"
   - Per-batch log line includes the actual `since` used so operators
     can verify per-repo decisions
   - Backwards compat: if since_by_repo is None, all repos use
     single `since` (legacy behavior preserved)

2. aggregator.py — fetch_pull_requests_batched forwards since_by_repo
   to connectors that support it. Uses inspect.signature to detect
   parameter availability — connectors without the new shape (older
   codebases or alt-source connectors) fall back to single-since
   gracefully.

3. _sync_pull_requests — pre-flight per-repo watermark fetch:
   - Loads ALL rows where entity_type='pull_requests' AND scope_key
     LIKE 'github:repo:%' in a single query
   - Builds since_by_repo: dict[repo_name, last_synced_at]
   - Logs "watermark plan: N repos with per-scope rows, global '*'
     fallback=..."
   - Passes both since (global) and since_by_repo to the fetcher
   - Existing per-repo WRITE side (Phase 2-A step 2.4) is now matched
     by READ side — full FDD-OPS-014 contract for PRs

Validation:
- inspect.signature confirms both connector and aggregator now
  expose since_by_repo as parameter
- 19 unit tests still passing (no test logic changed)
- Live behavior validated separately (per-scope writes confirmed
  before this commit: jira:project:OKM watermark = 3435 issues)

What's still missing for Phase 2-B closure:
- Jenkins per-repo since (Step 3) — write-side already shipped in
  Phase 2-A step 2.5; read-side analogous to this PR; lower priority
  given low deploy volume
- Smoke test: explicit "add new project, verify only that scope
  backfills" — not blocked, can run anytime
- docker-compose split (Step 2.6) — once deploys also have read-side,
  the per-source isolation becomes meaningful

Refs:
- Migration 010 + 011 (column add + legacy constraint drop)
- docs/ingestion-v2-phase-2-plan.md §0 "Suggested next iteration"
- ingestion-architecture-v2.md AP-3 (per-scope watermarks principle)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…deployments

Closes the deployments read-side gap (Phase 2-A wrote per-repo
deploy watermarks; Phase 2-B step 2.5-B now consumes them on read).
Each Jenkins job's `since` is resolved via the existing job→repo
mapping (built by `discover_jenkins_jobs.py` SCM scan). Adding a
new repo's job = backfill ONLY that scope. Existing jobs continue
from their repo's last_synced_at.

Three coordinated changes mirror the PR pattern from commit 4478f13:

1. jenkins_connector.py — fetch_deployments accepts since_by_repo:
   - Per-job since resolution: lookup self._job_to_repo[job_name]
     to get the repo, then since_by_repo.get(repo, since)
   - Pre-flight log: "Jenkins fetch: N jobs, M with per-repo
     watermark, rest use bulk since=..."
   - Backwards compat: since_by_repo=None → all jobs use single
     `since` (legacy behavior)

2. aggregator.py — fetch_deployments forwards since_by_repo with
   inspect.signature gating (graceful fallback for connectors
   without the parameter, e.g., GitHub Actions deploys when those
   land later).

3. _sync_deployments — pre-flight per-repo watermark fetch:
   - Loads ALL rows where entity_type='deployments' AND scope_key
     LIKE 'jenkins:repo:%'
   - Builds since_by_repo: dict[repo, last_synced_at]
   - Logs "watermark plan: N repos with per-scope rows, global
     '*' fallback=..."
   - Passes since + since_by_repo to fetch_deployments

What this completes:
- Issues: per-project R+W ✅ (Phase 2-A step 2.3)
- PRs:    per-repo    R+W ✅ (Phase 2-A 2.4 write + 2-B step 2 read)
- Deploys: per-repo   R+W ✅ (this commit)

What's still deferred:
- Smoke test: explicit "add new project, verify only that scope
  backfills" — requires manual action, not blocked
- docker-compose split (Step 2.6) — now meaningful since reads
  match writes; can be a separate small PR
- Migration 011 file is shipped (commit a separate piece of evening's
  work captured the legacy-constraint fix)

Validation:
- inspect.signature confirms Jenkins + Aggregator now expose
  since_by_repo parameter
- Force-recreate sync-worker successful, no import errors
- 19 unit tests still passing (no test logic changed)

Refs:
- Sister commit 4478f13 (PR per-repo reads)
- Migration 011 (drop legacy uq_watermark_entity, prerequisite)
- docs/ingestion-v2-phase-2-plan.md §0 next-iteration roadmap

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ction works

The bug: `_map_issue` extracted the changelog into the side-cache
`self._last_changelogs` but DROPPED the `changelog` key from the
returned mapped dict. The new `_sync_issues` flow (FDD-OPS-013) reads
`raw["changelog"]["histories"]` from the mapped dict via
`extract_status_transitions_inline()`. Because the key was missing,
the extractor returned `[]` for every issue — 311,007 issues landed
in `eng_issues` with `status_transitions=[]`, breaking every Lean,
Cycle Time and status-flow metric downstream.

The fix: include `jira_issue.get("changelog", {})` in the mapped
dict alongside the rest of the issue fields. Validated live on
project BG: re-synced 1,994 issues all came out with 3-8
transitions each, properly normalized.

Test guard added: `TestMapIssuePreservesChangelogForInlineExtraction`
wires `_map_issue` -> `extract_status_transitions_inline` end-to-end
against a Jira-shaped payload, and would have caught this regression
on day one. Existing tests checked the extractor in isolation, never
the contract between connector and worker.

Backfill of the 311k existing issues will follow as their normal
incremental sync cycles re-touch them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Webmotors and many enterprise tenants don't use Story Points. Audit
of the live Jira instance (2026-04-28) confirmed 0% population on
both `customfield_10004` ("Story Points") and `customfield_18524`
("Story point estimate") across all 69 active projects. Result: every
one of 311k issues had `story_points = 0`, blocking every Lean and
forecast metric downstream.

Squads use heterogeneous methods:
- ENO/DESC: T-shirt size + original estimate hours
- APPF/OKM: original estimate hours (sparse)
- BG/FID/PTURB: nothing — Kanban-pure, count items only

Implements a fallback chain in JiraConnector:

  1. Native Story Points / Story point estimate (numeric, preferred)
  2. T-Shirt Size (option) → Fibonacci scale: PP=1,P=2,M=3,G=5,GG=8,GGG=13
  3. Tamanho/Impacto (option) → same scale
  4. timeoriginalestimate (seconds) → SP-equiv buckets:
       ≤4h=1, ≤8h=2, ≤16h=3, ≤24h=5, ≤40h=8, ≤80h=13, >80h=21
  5. None — issue genuinely unestimated, metric layer counts items

Discovery is dynamic: `_discover_custom_fields` matches by field name
("t-shirt size", "tamanho/impacto"), so other tenants with different
custom-field IDs work without configuration.

Telemetry: `_effort_source_counts` tracks which strategy produced each
value (or "unestimated"), logged at end of each batched fetch. Operators
can spot estimation-mode shifts (e.g., squad migrating from hours to
t-shirt) without combing through traces.

Validated live on project CRMC (1,375 issues, full-history backfill):
52.3% coverage with effort estimates, values exclusively on the
Fibonacci scale (1, 2, 3, 5, 8 — confirms mapping is firing).

Tests: 34 new tests in test_effort_fallback_chain.py covering each hop,
each size mapping, each hour bucket, plus three Webmotors-shape
end-to-end sanity checks.

Backlog: also adds FDD-DEV-METRICS-001 — placeholder for the future
"dev-metrics" project (R3+) that will let admins choose estimation
method per-squad and run a proprietary forecasting model. This commit
locks in the prerequisite (extraction works for any method); the next
release plans the UX rewrite around it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…OPS-017)

THE BUG (panorama audit 2026-04-28): 311k issues showed an absurd
distribution — 96.5% done, 3.3% todo, 0.2% in_progress, 0.1% in_review.
Investigation revealed that Webmotors Jira has 104 distinct status
names across workflows but `DEFAULT_STATUS_MAPPING` only covered ~50.
Every uncovered status defaulted silently to "todo", including 2,881
issues with `FECHADO EM PROD` (which should be "done"), various
`Em desenv`/`Em Progresso` (in_progress), and `Homologação`/`Em
Verificação` (in_review).

Impact cascaded into status_transitions — the final transition of a
done issue was recorded with `status: "todo"` because the to_status
"FECHADO EM PROD" was misclassified. Result: corrupted Cycle Time
(no terminal "done"), under-counted Throughput, over-counted WIP,
distorted CFD across every Lean metric.

THE FIX — hybrid normalization in 3 layers:

  1. Textual `DEFAULT_STATUS_MAPPING` (preferred — preserves the
     in_progress vs in_review granularity Cycle Time needs). Expanded
     with ~80 PT-BR statuses observed in Webmotors workflows.

  2. Jira `statusCategory.key` fallback (authoritative for done/non-done).
     Connector calls /rest/api/3/status once and caches name→category.
     Discovered 326 status definitions in Webmotors:
       - "done" → done
       - "indeterminate" → in_progress
       - "new" → todo

  3. Default "todo" with WARN log (now reachable only when neither
     textual nor category match — extremely rare).

Wiring:
  - JiraConnector._discover_status_categories() (new, 1 call/lifetime)
  - JiraConnector._map_issue attaches status_category + status_categories_map
  - normalize_status(raw, mapping, status_category=...) signature extended
  - build_status_transitions(..., status_categories_map=...) classifies
    every historical to_status via the map (not just the current status)
  - normalize_issue threads both through

Quantified impact (cross-check vs current DB):
  3,151 issues will reclassify on next re-sync (1% of 311,068):
    - 2,923 todo → done   (the FECHADO EM PROD long tail)
    - 161   todo → in_review  (Homologação, Verificação)
    -  67   todo → in_progress (Em Progresso, Em desenv)

Backfill is via natural incremental sync (upsert overwrites both
normalized_status and status_transitions). Operators wanting to
accelerate can reset per-project watermarks. A migration-style
SQL backfill is deferred — needs separate plan.

Tests: 44 new in test_status_normalization.py covering textual-wins,
category fallback per case, Webmotors regression statuses, transitions
integration with the categories map, mapping-completeness guards.
116/116 pass.

Decisão de produto registrada (ops-backlog FDD-OPS-017): "FECHADO EM
HML" mapeado como done (Jira's category é done, nome literal é
FECHADO). Workflow author classifica como done; respeitamos.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
100% das 216 sprints da Webmotors estavam com status='' no DB. O `goal`
também totalmente vazio. Investigação revelou clássico "swiss cheese
alignment" — 4 bugs independentes em camadas diferentes, cada um sozinho
garantia que status nunca fosse populado:

  1. normalize_sprint() retornava dict SEM o campo `status` — derrubava
     antes de chegar ao upsert
  2. _upsert_sprints ON CONFLICT set_ não incluía `status` ou `goal`,
     então sprints existentes nunca recebiam update mesmo se chegassem
  3. _fetch_board_sprints filtrava por `started_date < since` — sprints
     que mudavam de active→closed depois do watermark nunca re-fetched
     (state transitions acontecem em endDate, não startDate)
  4. ORM model EngSprint não tinha o campo `status` (schema drift —
     coluna existia no DB há tempos, ORM nunca atualizado), causando
     "Unconsumed column names: status" em qualquer tentativa de upsert

Fix em todas as 4 camadas:

  - jira_connector._map_sprint agora também passa `goal` adiante
  - normalize_sprint() inclui `status` (lowercase active/closed/future/None)
    + `goal` (com strip de null bytes)
  - _upsert_sprints ON CONFLICT atualiza ambos
  - _fetch_board_sprints removeu filtro de watermark (volume baixo, ~216
    total / ~5 ativas, sempre re-fetch é o correto pois sprints mudam
    estado)
  - EngSprint model adiciona `status: Mapped[str|None]` (corrige drift)

Helper _normalize_sprint_status mapeia aliases (open→active,
completed→closed, planned→future) e devolve None para valores
desconhecidos — não bucketiza silenciosamente para não corromper
Velocity / Carryover logic que precisa saber QUE sprints estão de fato
fechadas.

Validação live (ad-hoc backfill após fix):
  - closed:  187 (com goal)
  - active:    3 (com goal)
  - future:    5 (com goal)
  - vazio:   22 (board órfão 873 sem projeto ativo, fora de escopo)

Total: 195/217 = 89.9% com status correto, 70% com goal real
("Gestão de banner no backoffice de CNC e TEMPO para novas
especificações técnicas", etc.).

Tests: 26 novos em test_sprint_normalization.py (status presente,
unknown→None, aliases, goal passthrough, structural anti-regression
que o set_ block inclui status+goal). 142/142 passam.

Lição: ORM drift foi o bug mais insidioso. Coluna existia no DB há muito
tempo; só o SQLAlchemy estava desatualizado. Path que omitia status
funcionava (silently empty); path que incluía status crashava.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…isting slots

Documents 4 data-quality fixes shipped 2026-04-29 inside the structured
slots that already existed in the docs (no new files created):

metrics-inconsistencies.md:
  - INC-020 (changelog drop em _map_issue → status_transitions=[] em 311k)
  - INC-021 (story_points=0 em 100% issues — Webmotors não usa SP)
  - INC-022 (status normalization 96.5% done skew, 50+ PT-BR unmapped)
  - INC-023 (sprint status sempre vazio — 4-layer swiss cheese)
  - Status bar + P0 impact list + counts (19→23 totais, P0 7→11)

ingestion-spec.md (1226→~1850 lines):
  - §1.1 Current State — data 2026-04-29 + números pós Phase 1
  - §2.2 Webmotors env — effort method, 326 status defs, Kanban-mostly
  - §4 Problem 6 REWRITE — hybrid normalization (textual+statusCategory)
  - §4 Problems 11/12/13 NEW — changelog drop, effort heterogeneity,
        sprint 4-layer cheese (cada com causa/fix/lições genéricas)
  - §6.3.6 NEW — Effort Extraction (Deterministic Core+Discovery Fallback)
  - §7.C — 19 commits novos da feat/jira-dynamic-discovery
  - §7.D NEW — Webmotors-Discovered Patterns (training material)
  - §8.10 REWRITE — Status Normalization hybrid approach
  - §8.12 NEW — Effort Estimation field decision
  - §8.13 NEW — Sprint Status & Goal field decision

ingestion-architecture-v2.md §9:
  - status por success criterion (3 ✅ atingidos, 2 ⚠️ parciais,
    1 ❌ pendente, 1 ⏳ TBD)
  - agregado por phase (Phase 1+2-A+2-B shipped, 2.6 + 3 pending)
  - bonus data-quality fixes registrados como expansão de escopo

Captura padrões pedagógicos descobertos:
  - cache lateral vs return value anti-pattern (INC-020)
  - schema drift entre migration e ORM (INC-023)
  - swiss cheese alignment (INC-023, 4 bugs independentes)
  - hybrid textual+categorical normalization (INC-022)
  - fail-loud unknown values (effort + sprint status)
  - telemetry-via-counter (_effort_source_counts)
  - cascading data corruption (status → status_transitions → todas Lean)

Webmotors environment characteristics consolidadas como baseline de
training para futuros tenant onboardings via Ingestion Intelligence
Agent (Section 6.5). ADR-005 + ADR-014 inalterados — decisões
arquiteturais permanecem; este commit captura o aprendizado da
implementação.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lock file is per-session/per-process state (PID + sessionId), not code.
projects/ contains Claude Code's own session transcripts (JSONL files
~38MB+ each), not project data — never should be tracked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nascimentolimaandre-cloud nascimentolimaandre-cloud merged commit a09d38c into main Apr 29, 2026
4 checks passed
nascimentolimaandre-cloud pushed a commit that referenced this pull request Apr 29, 2026
…uards

Second of 5 PRs building the new-developer onboarding path. Lands the
heart of the work: a Python script that populates a clean dev DB with
~7000 rows of realistic-but-clearly-synthetic data so a fresh clone
renders a working dashboard without external credentials.

What this PR ships:

  scripts/seed_dev.py     — the seed (single file, ~700 lines)
  scripts/__init__.py     — package marker
  Dockerfile              — adds COPY scripts/ scripts/ (was missing)
  Makefile                — `make seed-dev` + `make seed-reset` targets
  tests/unit/test_seed_dev.py — 28 unit tests (guards + determinism + shape)

Data volume (default, ~3s wall time):

  - 15 squads across 4 tribes (Payments, Core Platform, Growth, Product)
  - 51 distinct repos, plausibly named (`payments-api`, `auth-service`, ...)
  - ~1900 PRs, log-normal lead-time distribution per squad
  - ~4900 issues with realistic status mix (15/20/10/55 todo/in_progress/in_review/done)
  - ~200 deploys (jenkins source, weekly cadence)
  - 60 sprints across 10 sprint-capable squads
  - 32 pre-computed metrics_snapshots (4 periods × 8 metric_names)
  - 15 jira_project_catalog entries (status=active)
  - 4 pipeline_watermarks (recent timestamps for fresh-data UI signal)

Pre-compute target: dashboard renders in <1s on first visit. The
2026-04-24 incident fixed the underlying index regression on real data;
this seed makes the same outcome reproducible in fresh environments by
inserting snapshots directly. No more 50× cold-path on first home view.

Distribution intentionally covers ALL dashboard states:

  Elite:     PAY, API
  High:      AUTH, CHK, UI
  Medium:    BILL, INFRA, MKT, MOB, RET
  Low:       OBS, SEO, CRO
  Degraded:  QA       (data sources stale)
  Empty:     DSGN     (no PRs in window — exercises empty state)

Five-layer safety (ordered cheapest first, fail-fast on any layer):

  1. CLI gate    — --confirm-local must be passed explicitly
  2. Env gate    — PULSE_ENV != production / staging / prod / stg
  3. Host gate   — DB hostname ∈ {localhost, postgres, 127.0.0.1, ::1}
  4. Tenant gate — target tenant must be 00000000-...0001 (reserved dev)
  5. Data gate   — tenant must be empty OR --reset must be set

Every inserted row has external_id prefixed with `seed_dev:` so cleanup
queries are precise (LIKE 'seed_dev:%') and contamination is detectable
(non-prefixed rows in the dev tenant = real data leaked in).

Determinism: random.Random(seed=42) by default, configurable via --seed.
Same seed produces byte-identical output. Locked by 28 unit tests.

Reset strategy:

When --reset is set, the script tries TRUNCATE first (instant) and only
falls back to DELETE WHERE tenant_id when the table has rows from OTHER
tenants. The dev box hit this: `DELETE FROM metrics_snapshots WHERE
tenant_id=...` was 21+ minutes for 7M rows because the existing index
order didn't help; TRUNCATE on a single-tenant table is sub-second.
Both paths log which strategy was used per table for transparency.

PR title format embeds Jira-style keys (`PAY-123`, `AUTH-45`) because
/pipeline/teams derives the active squad list via regex over titles.
Without that key, the endpoint returns "0 squads" even though 1900 PRs
exist — discovered during smoke test, locked in
TestPrTitleShape::test_title_contains_jira_style_key so future
template changes can't silently break /pipeline/teams.

Surface API:

  python -m scripts.seed_dev --confirm-local             # clean tenant only
  python -m scripts.seed_dev --confirm-local --reset     # wipe + seed
  python -m scripts.seed_dev --confirm-local --seed 99   # different fixture

  make seed-dev          # equivalent to first
  make seed-reset        # equivalent to second; prompts for "YES" confirmation

End-to-end validation (against the live dev DB after this PR):

  $ make seed-reset    → wipes 442k real rows in <1s, seeds fresh in ~3s
  $ make verify-dev    → all green:
       ✓ pulse-api /api/v1/health     200
       ✓ pulse-data /health           200
       ✓ GET /metrics/home            deployment_frequency = 0.31
       ✓ GET /pipeline/teams          14 squads (≥ 10 required)
       ✓ vite dev server              200
       Stack is healthy.

  $ docker compose exec -T pulse-data python -m pytest tests/unit/test_seed_dev.py -v
       28 passed in 0.22s

Tests cover:
  - All 4 pure guards (CLI flag, env, host, tenant) including param sweeps
  - Squad profile structure (15 squads, 4 tribes, archetype mix)
  - Determinism (same seed → byte-identical, different seeds → diverge)
  - PR title shape (Jira-key extractable by /pipeline/teams regex)
  - Marker prefix sanity (filterable, distinctive)

Guard 5 (data state) requires a session and is exercised by the
end-to-end smoke instead of a unit test, intentional — keeps unit
tests fast and DB-free.

Out of scope (next PRs):

  - PR #3: UI banner showing "DEV FIXTURE" when seed tenant detected
  - PR #4: `make onboard` orchestrator + backend-in-CI smoke gate (FDD-OPS-004)
           + perf budget assertions (FDD-OPS-006)
  - PR #5: Doppler overlay for optional real ingestion
  - FDD-OPS-010: --scale=large flag for perf testing (~100k PRs)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nascimentolimaandre-cloud nascimentolimaandre-cloud deleted the pr4-ingestion-v2 branch April 29, 2026 04:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant