From 0a27dc691be3be2210779bbd72dc124de36e014a Mon Sep 17 00:00:00 2001 From: "Andre.Nascimento" Date: Thu, 23 Apr 2026 10:28:08 -0300 Subject: [PATCH 01/18] =?UTF-8?q?feat(ops):=20FDD-OPS-001=20lines=201+2=20?= =?UTF-8?q?=E2=80=94=20eliminate=20stale-code-in-workers=20drift?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Addresses the recurring "workers run old bytecode in memory after commits" problem that caused 3 documented incidents in a 3-day span (16-18/04): - 16/04: INC-001/002 throughput identical across periods (worker had pre-fix _PERIODS in memory) - 17/04: Metrics zero-valued after INC-003/004 fix applied on disk - 18/04: Lead Time card blank (tenant-wide DORA snapshot missing strict fields because worker was running pre-strict code) Pattern: commit domain/service code → worker keeps running old in-memory bytecode until explicit `docker compose restart`. Reactive fixes cost 5-30min each; multi-tenant SaaS (R1) would expose this as customer incident. ═══════════════════════════════════════════════════════════════════════════ LINE 1 — Hot-reload in dev via `docker compose watch` ═══════════════════════════════════════════════════════════════════════════ Added `develop.watch` blocks to 4 Python services in pulse/docker-compose.yml: - pulse-data (FastAPI) - metrics-worker (Kafka consumer → snapshot writer) - sync-worker (DevLake → Kafka producer) - discovery-worker (Jira dynamic discovery) Each watch block: action: sync+restart path: ./packages/pulse-data/src target: /app/src Usage: cd pulse && docker compose watch Any edit under packages/pulse-data/src/ triggers automatic sync + restart of the affected containers. Docker Compose 5.1.0 (local) supports this natively — no plugin needed. ═══════════════════════════════════════════════════════════════════════════ LINE 2 — Admin force-reload (80% ROI, validated) ═══════════════════════════════════════════════════════════════════════════ POST /data/v1/admin/metrics/recalculate now calls importlib.reload() on 8 domain/service modules BEFORE running the recalculation, guaranteeing the freshest bytecode regardless of worker state. Modules force-reloaded: - src.contexts.metrics.domain.dora - src.contexts.metrics.domain.cycle_time - src.contexts.metrics.domain.lean - src.contexts.metrics.domain.throughput - src.contexts.metrics.domain.sprint - src.contexts.metrics.services.recalculate - src.contexts.metrics.services.home_on_demand - src.contexts.metrics.services.flow_health_on_demand Key implementation detail: after importlib.reload("...services.recalculate"), the top-level `_recalc_service` reference still points to the OLD function object. The endpoint now re-resolves the function via `sys.modules[...].recalculate` before calling, with a fallback to the original import for safety. Response of /admin/metrics/recalculate gained `reloaded_modules: list[str]` field — backward-compat (field added, none removed). Validation (runtime against local stack): POST /data/v1/admin/metrics/recalculate?metric_type=dora&period=60d&dry_run=true → status: completed, duration: 170ms, reloaded_modules: [8 modules] ═══════════════════════════════════════════════════════════════════════════ WHY THIS IS 80% OF THE PROBLEM ═══════════════════════════════════════════════════════════════════════════ All 3 documented incidents had the same resolution pattern: user reports weird numbers → operator hits /admin/recalculate. With line 2, that same action now also reloads the fresh code — no separate "restart then recalc" dance. Line 1 covers the dev-time loop (editing code locally). Lines 3 (snapshot contract monitor + Prometheus metric) and 4 (CI/CD restart on deploy) are the defensive perimeter for the remaining 20% — scheduled for follow-up once the team has rollout pipeline hardened. Tracked in FDD-OPS-001. ═══════════════════════════════════════════════════════════════════════════ RISKS / NON-REGRESSIONS ═══════════════════════════════════════════════════════════════════════════ - Backward compat: endpoint signature unchanged; response adds 1 field - Defensive: if importlib.reload fails on any module, logs WARN and continues — recalc still executes (worst case: runs with stale code, which was pre-existing behavior anyway) - Only 8 pure-function modules reloaded. SQLAlchemy models, Kafka consumer, repositories, Pydantic schemas left intact (reloading those would break FastAPI validation in-flight) - Module identity: dataclasses reconstructed per-call; no persistent instances cross the reload boundary. isinstance() checks stay valid Files changed: pulse/docker-compose.yml pulse/packages/pulse-data/src/contexts/metrics/routes.py Co-Authored-By: Claude Opus 4.6 --- pulse/docker-compose.yml | 29 +++++++ .../pulse-data/src/contexts/metrics/routes.py | 79 ++++++++++++++++++- 2 files changed, 107 insertions(+), 1 deletion(-) diff --git a/pulse/docker-compose.yml b/pulse/docker-compose.yml index a01df8f..599f692 100644 --- a/pulse/docker-compose.yml +++ b/pulse/docker-compose.yml @@ -74,6 +74,17 @@ services: kafka: condition: service_healthy restart: unless-stopped + # FDD-OPS-001 Linha 1: auto-reload the FastAPI container when Python + # source changes. The pulse-data Dockerfile does NOT launch uvicorn with + # `--reload`, so without this block a `git pull` or file edit leaves the + # HTTP process running stale code until a manual `docker compose restart`. + # `sync+restart` rewrites the files inside the container and then restarts + # the container so Python reimports from disk. + develop: + watch: + - action: sync+restart + path: ./packages/pulse-data/src + target: /app/src # -------------------------------------------------------------------------- # Workers @@ -116,6 +127,12 @@ services: retries: 3 start_period: 60s restart: unless-stopped + # FDD-OPS-001 Linha 1: hot-reload — see pulse-data block above. + develop: + watch: + - action: sync+restart + path: ./packages/pulse-data/src + target: /app/src metrics-worker: build: @@ -141,6 +158,12 @@ services: retries: 3 start_period: 60s restart: unless-stopped + # FDD-OPS-001 Linha 1: hot-reload — see pulse-data block above. + develop: + watch: + - action: sync+restart + path: ./packages/pulse-data/src + target: /app/src discovery-worker: build: @@ -167,6 +190,12 @@ services: redis: condition: service_healthy restart: unless-stopped + # FDD-OPS-001 Linha 1: hot-reload — see pulse-data block above. + develop: + watch: + - action: sync+restart + path: ./packages/pulse-data/src + target: /app/src # -------------------------------------------------------------------------- # Infrastructure diff --git a/pulse/packages/pulse-data/src/contexts/metrics/routes.py b/pulse/packages/pulse-data/src/contexts/metrics/routes.py index 3cdab97..927134f 100644 --- a/pulse/packages/pulse-data/src/contexts/metrics/routes.py +++ b/pulse/packages/pulse-data/src/contexts/metrics/routes.py @@ -7,8 +7,10 @@ from __future__ import annotations +import importlib import logging import re +import sys from datetime import datetime, timedelta, timezone from typing import Any from uuid import UUID @@ -1085,6 +1087,65 @@ async def get_flow_health( admin_router = APIRouter(prefix="/data/v1/admin/metrics", tags=["metrics-admin"]) +# Modules whose latest-on-disk bytecode should be picked up by every recalc. +# Domain modules are pure functions (no global state, no singletons) so +# `importlib.reload` is safe. Service modules orchestrate the domain calls — +# also safe to reload because they hold no in-process caches or background +# workers; each request constructs a fresh service call tree. +# +# FDD-OPS-001 (Linha 2): closes the "code deployed vs runtime in memory" +# drift gap. Python caches imported modules in `sys.modules`, so after a +# file edit or `git pull` the worker process still executes the OLD code +# until restart. Admin recalcs are the place where ops users already go +# when data looks wrong — giving them a guaranteed-fresh code path there +# fixes 80% of the documented drift incidents without requiring a full +# container restart. +_RELOAD_TARGETS: tuple[str, ...] = ( + "src.contexts.metrics.domain.dora", + "src.contexts.metrics.domain.cycle_time", + "src.contexts.metrics.domain.lean", + "src.contexts.metrics.domain.throughput", + "src.contexts.metrics.domain.sprint", + "src.contexts.metrics.services.recalculate", + "src.contexts.metrics.services.home_on_demand", + "src.contexts.metrics.services.flow_health_on_demand", +) + + +def _force_reload_metrics_modules() -> list[str]: + """Force-reload metrics domain/service modules to pick up freshest bytecode. + + Python doesn't hot-reload by default; once a module is imported, subsequent + `import` statements return the cached version in `sys.modules`. After a + `git pull` or file edit, the worker process still executes the OLD module + version until restart. `importlib.reload()` re-executes the module body, + refreshing function definitions and module-level constants. + + Safe for pure domain modules and stateless service modules; would NOT be + safe for modules that hold singletons, background workers, or mutate + registries at import time. The targets listed in `_RELOAD_TARGETS` have + been audited for this. + + Returns the list of modules that were successfully reloaded. Failures are + logged as WARN and skipped — a partial reload is better than an aborted + recalc. + """ + reloaded: list[str] = [] + for mod_name in _RELOAD_TARGETS: + mod = sys.modules.get(mod_name) + if mod is None: + # Not yet imported — next `import` will load fresh code anyway. + continue + try: + importlib.reload(mod) + reloaded.append(mod_name) + except Exception as exc: # noqa: BLE001 — defensive: never abort recalc + logger.warning( + "[admin] importlib.reload failed for %s: %s", mod_name, exc + ) + return reloaded + + def _check_admin_token(x_admin_token: str | None) -> None: """Validate the admin token using constant-time comparison. @@ -1129,8 +1190,23 @@ async def admin_recalculate_metrics( tenant_id, metric_type, period, team_id, dry_run, ) + # FDD-OPS-001 Linha 2: force-reload domain/service modules so the recalc + # executes the freshest bytecode on disk regardless of what the worker + # process had cached in `sys.modules`. + reloaded_modules = _force_reload_metrics_modules() + logger.info( + "[admin] Force-reloaded %d metric modules: %s", + len(reloaded_modules), reloaded_modules, + ) + + # Re-resolve the recalculate function from the freshly reloaded module — + # the top-level `_recalc_service` import still points to the previous + # function object after `importlib.reload()`, so bypass the stale binding. + recalc_module = sys.modules.get("src.contexts.metrics.services.recalculate") + recalc_fn = getattr(recalc_module, "recalculate", _recalc_service) if recalc_module else _recalc_service + try: - result = await _recalc_service( + result = await recalc_fn( tenant_id=tenant_id, metric_type=metric_type, period=period, @@ -1153,4 +1229,5 @@ async def admin_recalculate_metrics( "snapshots_written": result.snapshots_written, "scanned": result.scanned, "errors": result.errors, + "reloaded_modules": reloaded_modules, } From a05b37034f7c4a3b65ca23ddbc3bb62eb54a64cb Mon Sep 17 00:00:00 2001 From: "Andre.Nascimento" Date: Thu, 23 Apr 2026 10:33:11 -0300 Subject: [PATCH 02/18] =?UTF-8?q?fix(sec):=20FDD-SEC-001=20=E2=80=94=20rej?= =?UTF-8?q?ect=20squad=5Fkey=20with=20invalid=20chars=20(HTTP=20422)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Security finding discovered during QW-2 test implementation (testing- foundation-v1.0, 20/04): /metrics/home accepted squad_key with arbitrary special characters (e.g. 'FID;DROP' returned HTTP 200). Backend was safe from actual SQL injection thanks to sqlalchemy bindparams, but: 1. Should reject malformed input at the FastAPI validation layer, not silently treat it as a harmless filter 2. Defense-in-depth: catching bad input upfront reduces blast radius 3. Consistency: /pipeline/routes.py already had the correct pattern Fix: - Added constant `_SQUAD_KEY_PATTERN = r"^[A-Za-z][A-Za-z0-9]{1,31}$"` in pulse-data/src/contexts/metrics/routes.py — same convention as pipeline/routes.py - Applied `pattern=_SQUAD_KEY_PATTERN` to squad_key Query param on ALL 6 metrics endpoints: /dora, /cycle-time, /throughput, /lean, /sprints, /home, /flow-health (unified the inline pattern /flow-health had) - Regex allows 2-32 chars starting with letter, rest alphanumeric. Covers every real Jira project key observed (min 2 chars per Atlassian convention). Rejects: FID;DROP, FID', FID UNION,