diff --git a/test-env-delay/test-env-delay-investigation.md b/test-env-delay/test-env-delay-investigation.md new file mode 100644 index 0000000..09516fa --- /dev/null +++ b/test-env-delay/test-env-delay-investigation.md @@ -0,0 +1,260 @@ +# Test Environment Delay: 11-Hour Incident Report + +**Author:** Willis Kirkham (wkirkham@mural.co) +**Date:** March 11, 2026 + +## Summary + +On March 11, 2026, the `muralmember-lastaccessdate` test environment (PR #15619, mural-api) took approximately 11 hours to reach Ready status. The environment was created at 03:11 UTC but did not become available until 14:25 UTC. During this time, the API pod was never scheduled on any cluster node — it sat in Pending/FailedScheduling state for 9.5 hours with zero API logs, zero Redis writes, and no container ever started. + +The root cause was two compounding failures. First, a series of force pushes and intermediate commits caused the in-namespace operator to cycle through 10+ image tags in 98 minutes, creating ReplicaSets whose pods could not pull images that did not yet exist in ACR. This rapid cycling inflated the pod count across the namespace. Second, the cumulative pod pressure from this namespace and all other test environment namespaces exhausted the cluster's scheduling capacity — 19 of 20 nodes were at their pod limit and 1 had insufficient CPU. With no cluster autoscaler enabled, there was no mechanism to recover. The environment finally reached Ready when an unrelated mural-api master merge (PR #15600) at 14:19 UTC triggered rolling updates across dozens of namespaces, which terminated old pods and freed scheduling slots. + +This incident exposed a systemic risk. The operator routinely deploys images before they exist in ACR, creating pods that consume scheduling slots while stuck in ImagePullBackOff. The cluster runs at capacity with no autoscaler. These conditions will produce similar delays again without targeted fixes to the operator's deployment logic and cluster capacity management. + +| Field | Value | +|-------|-------| +| **PR** | mural-api #15619 | +| **Branch** | `migrate/muralmember-lastaccessdate` | +| **Env name** | `muralmember-lastaccessdate` | +| **Namespace** | `tenv-muralmember-lastaccessdate` | +| **Cluster** | `testing-envs-v2-aks` | + +--- + +## Root Cause + +Two compounding issues created the 11-hour delay. + +### 1. Force Push Image Cycling (03:09--04:47 UTC, ~98 minutes) + +Between 03:09 and 04:47 UTC, the developer made multiple force pushes and intermediate commits to the branch. Each time the branch HEAD changed, the in-namespace operator detected the new SHA via effort-center and immediately patched the API deployment with the new image tag. This created 10+ ReplicaSets in 98 minutes. Most of these image tags did not exist in ACR because CI had been cancelled by the subsequent force push or had not yet completed. The pods from each ReplicaSet entered ImagePullBackOff and occupied scheduling slots without ever running. + +The operator settled on the final SHA `7b2cfa3e` at 04:47 UTC. CI published that image to ACR at 04:57 UTC — 10 minutes after the ReplicaSet was created. However, by that point, the cumulative pod count from the image cycling had contributed to saturating the cluster's scheduling capacity. + +### 2. Cluster Capacity Exhaustion (04:47--14:22 UTC, ~9.5 hours) + +Both final API pods (`api-546c59447b` and `api-7d75b456c6`) were stuck in Pending state from 04:47 UTC onward. They were never placed on any node. At 08:24 UTC, FailedScheduling events confirmed the cluster was full: + +``` +0/20 nodes are available: 19 Too many pods, 1 Insufficient cpu. +preemption: 0/20 nodes are available: 20 No preemption victims found. +``` + +A total of 19 FailedScheduling events fired at 08:24 UTC for pods in this namespace alone (api, api-eu, api-au, worker-exports, and others). The cluster autoscaler is disabled on `testing-envs-v2-aks`, so there was no mechanism to add capacity or free slots overnight. It is unclear why the cluster did not naturally recover capacity during the ~8 hours between 05:00 and 14:00 UTC — pods from other namespaces may have been in similar stuck states, maintaining the pressure. + +### Resolution + +At 14:19:51 UTC, mural-api PR #15600 merged to master. The operator's `getHashToDeploy()` function falls back to master HEAD when no branch override is set, so this triggered deployment patches across dozens of test environment namespaces simultaneously. The resulting wave of rolling updates terminated old pods and freed scheduling slots. Both Pending API pods were scheduled for the first time at 14:22 UTC: + +- `api-7d75b456c6-7zm6z` landed on `vmss000007` and immediately hit ImagePullBackOff (the `0b19eca1` image never existed in ACR). +- `api-546c59447b-qq74z` landed on `vmss00000a`, pulled the `7b2cfa3e` image instantly, and started the API container. + +The API served its first `GET /ping 200` at 14:23:20 UTC. The backing operator's health check passed, and the environment was marked Ready at 14:25 UTC. + +### Condensed Timeline + +| Time (UTC) | Event | +|------------|-------| +| 03:09 | Namespace created, operator bootstraps with `7af064b3` (image does not exist in ACR) | +| 03:37 | Force push #1 -- operator begins cycling through SHAs | +| 04:45 | Force push #2 | +| 04:47 | Operator settles on final SHA `7b2cfa3e` (10th+ ReplicaSet) | +| 04:57 | CI publishes `7b2cfa3e` image to ACR | +| 08:24 | FailedScheduling events confirm cluster full: `19 Too many pods, 1 Insufficient cpu` | +| 14:19 | mural-api PR #15600 merges to master, triggering mass deployment patches | +| 14:22 | Both API pods scheduled for the first time; `7b2cfa3e` pod pulls image successfully | +| 14:23 | API starts, health check passes | +| 14:25 | Env marked Ready | + +### Open Questions + +The overnight capacity situation remains unclear. With 19 of 20 nodes at their pod limit and no autoscaler, there was no automatic recovery mechanism. Old stuck pods from other namespaces may have maintained the pressure, but we lack the audit logs to confirm this — the AKS cluster has zero diagnostic settings configured. + +An effort-center Redis outage occurred at approximately 13:43--13:56 UTC (80 timeout errors). Its relationship to the 14:22 recovery events is uncertain. + +--- + +## Recommendations + +1. **Verify image existence before deploying.** The operator currently patches deployments with a new image tag the moment it detects a new branch HEAD SHA, regardless of whether CI has finished building. This guarantees ImagePullBackOff for every intermediate commit, creating pods that consume scheduling slots without ever running. The operator should check ACR for the image tag before patching the deployment. This is the highest-impact fix because it directly reduces both the image cycling problem and the cluster capacity pressure. + +2. **Limit ReplicaSet churn from rapid SHA changes.** During this incident, 10+ ReplicaSets were created in 98 minutes. Each created new pods that occupied scheduling slots even when they could not pull their images. The operator should debounce deployment patches (wait N seconds after detecting a new SHA before patching) or clean up old ReplicaSets and pods when a newer SHA arrives. + +3. **Address cluster capacity.** The cluster was saturated overnight and again at 14:22 UTC when the master merge triggered mass deployment patches. With autoscaling disabled, there is no safety valve. Options include enabling the cluster autoscaler, increasing `maxPods` per node, adding more nodes, or reducing per-namespace pod counts (for example, skipping multi-geo pods that are not needed in test environments). + +4. **Enable Kubernetes audit logging.** The AKS cluster `testing-envs-v2-aks` has zero diagnostic settings configured. No `kube-audit` or `kube-audit-admin` logs are being collected. Enabling `kube-audit-admin` (write operations only, lower volume) would capture all pod deletions, evictions, and deployment patches. This requires adding an `azurerm_monitor_diagnostic_setting` resource to `terraform-mural-testing-envs/aks.tf` pointing to a Log Analytics workspace. + +5. **Improve kubelet event collection in Datadog.** Pod-level events such as ImagePullBackOff, CrashLoopBackOff, and FailedScheduling from the kubelet are only partially visible in Datadog today. More comprehensive collection would make diagnosing scheduling and image pull issues significantly faster. + +6. **Detect FailedScheduling and alert.** The operator has no logic for detecting stuck pods. If a pod remains in Pending/FailedScheduling for more than N minutes, the operator could alert or take remediation action such as cleaning up old ReplicaSets to free scheduling slots. The `triggerRollout()` method exists but is only invoked for secret override changes. + +--- + +## Detailed Timeline + +All times are UTC. + +| Time (UTC) | Event | +|------------|-------| +| Mar 11 ~03:07 | PR #15619 opened. Branch HEAD = `7af064b3` | +| Mar 11 ~03:09 | Bootstrap creates namespace. `access-initialization` Job + `seeder` Job created. Seeder completes. | +| Mar 11 ~03:11 | Env marked **Created**. Operator bootstraps with image tag `7af064b3`. | +| Mar 11 ~03:29 | CI run starts for original commit | +| Mar 11 03:37 | **Force push #1** -> `e584b0f`. CI cancelled. | +| Mar 11 03:38 | Operator patches api deployment -> `e584b0f` | +| Mar 11 03:40 | Operator patches api -> `c01f5c7d` (fixup commit, brief HEAD) | +| Mar 11 04:12 | Operator patches api -> `88e72b07` (fixup commit) | +| Mar 11 04:22 | Operator patches api -> `7cac457a` (fixup commit) | +| Mar 11 04:35 | Operator patches api -> `7f59a4ff` (pre-force-push #2) | +| Mar 11 04:45 | **Force push #2** -> `0b19eca` | +| Mar 11 04:46 | Operator patches api -> `0b19eca` | +| Mar 11 04:47 | Operator patches api -> `7b2cfa3` (fixup commit pushed to branch) | +| Mar 11 04:57 | CI publishes `7b2cfa3` image to `mural.azurecr.io` | +| Mar 11 05:03 | CI run fully completes | +| *~9.5 hour gap* | API pods in Pending/FailedScheduling. No API logs, no Redis writes. | +| Mar 11 08:24 | FailedScheduling events confirm `0/20 nodes available: 19 Too many pods, 1 Insufficient cpu` | +| Mar 11 14:19 | mural-api PR #15600 merges to master (`a3bac2d730`) | +| Mar 11 14:22 | Both API pods scheduled for the first time. `api-546c59447b-qq74z` -> `vmss00000a`, image pulled successfully. `api-7d75b456c6-7zm6z` -> `vmss000007`, ImagePullBackOff (`0b19eca1` not in ACR). | +| Mar 11 14:23 | API starts, serves `GET /ping 200` | +| Mar 11 14:25 | Env marked **Ready** | +| Mar 11 15:27 | Operator patches api -> `162ff979` (new push by kevinmckinley-mural) | +| Mar 11 16:35 | Operator patches api -> `04e14761` (force push #3) | +| Mar 11 19:23 | Operator patches api -> `084ff2a6` | +| Mar 11 20:29 | Operator patches api -> `68366513` | + +### ReplicaSet History + +``` +api-5d5dcd6d99 2026-03-11T03:38:08Z mural.azurecr.io/tactivos/mural-api:e584b0f6... +api-6776dc65b 2026-03-11T03:40:07Z mural.azurecr.io/tactivos/mural-api:c01f5c7d... +api-87b6d8477 2026-03-11T04:12:07Z mural.azurecr.io/tactivos/mural-api:88e72b07... +api-5444f5cdf7 2026-03-11T04:22:07Z mural.azurecr.io/tactivos/mural-api:7cac457a... +api-988db68fb 2026-03-11T04:35:06Z mural.azurecr.io/tactivos/mural-api:7f59a4ff... +api-7d75b456c6 2026-03-11T04:46:06Z mural.azurecr.io/tactivos/mural-api:0b19eca1... +api-546c59447b 2026-03-11T04:47:08Z mural.azurecr.io/tactivos/mural-api:7b2cfa3e... +api-b84fb67b9 2026-03-11T15:27:06Z mural.azurecr.io/tactivos/mural-api:162ff979... +api-5bfcc999bc 2026-03-11T16:35:06Z mural.azurecr.io/tactivos/mural-api:04e14761... +api-5d4b685 2026-03-11T19:23:07Z mural.azurecr.io/tactivos/mural-api:084ff2a6... +api-64c94cd95b 2026-03-11T20:29:07Z mural.azurecr.io/tactivos/mural-api:68366513... +``` + +### Deployment-Controller Scale-Up Events (from Datadog) + +``` +03:09:57 Scale up api-86769d55ff (initial bootstrap) +03:17:24 Scale up api-58b7559868 (2nd SHA) +03:30:08 Scale up api-5c9994599c (3rd SHA) +03:38:08 Scale up api-5d5dcd6d99 (e584b0f — force push #1) +03:40:07 Scale up api-6776dc65b (c01f5c7d) +04:12:07 Scale up api-87b6d8477 (88e72b07) +04:22:07 Scale up api-5444f5cdf7 (7cac457a) +04:35:06 Scale up api-988db68fb (7f59a4ff) +04:46:06 Scale up api-7d75b456c6 (0b19eca1) +04:47:08 Scale up api-546c59447b (7b2cfa3e) — FINAL +``` + +--- + +## Appendices + +### Appendix A: ACR Image Tag Verification + +| SHA | Exists in `mural.azurecr.io`? | Created (UTC) | Notes | +|-----|-------------------------------|---------------|-------| +| `7af064b3` | No | -- | Original bootstrap SHA. Never published (CI cancelled). `access-initialization` Job stuck on this for 19h. | +| `e584b0f` | Not checked | -- | Force push #1 SHA. CI cancelled. | +| `0b19eca` | No | -- | Force push #2 HEAD. CI cancelled ~33s after start. | +| `7b2cfa3` | Yes | 2026-03-11T04:57:42Z | Successful CI run. Tags: `7b2cfa3...`, `tree-16f2c05d...` | +| `162ff97` | Yes | Not checked | Next morning push. Published by successful CI run. | + +### Appendix B: Operator Architecture + +**Backing namespace operator** (`backing` namespace): +- Controllers: TestEnvsBootstrap, TestEnvsDeprovision, TestEnvsOverrides +- Runs `reconcile()` periodically, checking `urlAvailable()` for all "Created" environments +- `urlAvailable()` makes a `superagent.get(url).timeout({ deadline: 5000, response: 5000 })` call +- If the URL responds, the environment is marked Ready. Otherwise, no action is taken until the next reconcile cycle. +- Has no dependency on Job completion -- only checks URL accessibility. + +**In-namespace operator** (`tenv-muralmember-lastaccessdate` namespace): +- Controllers: Platform, Frontend, Configuration +- Runs every ~40 seconds +- Uses `getHashToDeploy()` with effort-center as primary source and GitHub as fallback +- Patches deployments when branch HEAD changes + +**effort-center:** +- Polls GitHub GraphQL every ~30 seconds for branch HEAD commit SHAs +- Returns `ref.target.oid` (branch HEAD commit SHA) + +**Health check URL:** +The readiness check URL is `https://muralmember-lastaccessdate.mural.engineering/` (domain hardcoded in `config/defaults.json`). This is an external URL that routes through DNS, the external ingress controller, and TLS termination before reaching the API pod. + +### Appendix C: Image Publishing Pipeline + +The CI pipeline uses two registries in two phases: + +**Phase 1 -- Build** (target: `muralci.azurecr.io`): +- Images tagged with `{run_id}-{run_number}` and branch name +- All 4 production builds must complete (api, worker, migrations, 8ball) + +**Phase 2 -- Publish** (target: `mural.azurecr.io`): +- Takes images from `muralci.azurecr.io` and retags them for `mural.azurecr.io` +- Tags applied: branch name, commit SHA (`pull_request.head.sha` for PRs), tree hash +- Requires all 4 builds to succeed -- if any is cancelled, publish is skipped +- The retag shortcut (`tree-image-exists`) only works for `push`/`workflow_dispatch` events + +### Appendix D: Datadog Log Excerpt + +Interlaced API and Redis logs from the incident period (times in UTC): + +``` +03:24:07 [redis] Redis is starting... Ready to accept connections (primary Redis up) +03:34:07 [mural-api] ECONNREFUSED 10.43.190.15:6379 (redis-eu, not primary) +03:34:58 [mural-api] ECONNREFUSED 10.43.190.15:6379 +03:35:50 [mural-api] ECONNREFUSED 10.43.190.15:6379 +03:36:56 [mural-api] ECONNREFUSED 10.43.190.15:6379 +03:37:38 [mural-api] ECONNREFUSED 10.43.190.15:6379 (last API log) +03:39:11 [redis] Redis is starting... Ready to accept connections (redis-eu starts) +04:24:08 [redis] 1 changes in 3600 seconds. Saving... (primary hourly save) +04:39:12 [redis] 1 changes in 3600 seconds. Saving... (redis-eu hourly save) +======== 9.5 HOUR GAP — NO LOGS FROM ANY CONTAINER ======== +14:23:19 [redis] 1 changes in 3600 seconds. Saving... (first redis log next day) +14:23:20 [mural-api] statsd: count data.readps (API starts) +14:23:20 [mural-api] GET /ping 200 124.942 ms (health check passes) +``` + +The ECONNREFUSED errors targeted `10.43.190.15` (the `redis-eu` ClusterIP). The redis-eu pod did not start until 03:39 -- after the errors. These are non-fatal startup noise from the multigeo configuration. Primary Redis (`10.43.46.37`) was healthy the entire time. + +### Appendix E: Evidence Summary + +| Evidence | Source | What it proves | +|----------|--------|----------------| +| 10 FailedScheduling events at 08:24 UTC (5 per api RS) | Datadog Events | Both API pods Pending all night -- never scheduled on any node | +| 19 total FailedScheduling events at 08:24 UTC in namespace | Datadog Events | Cluster full: api, api-eu, api-au, worker-exports all unschedulable | +| `0/20 nodes: 19 Too many pods, 1 Insufficient cpu` | FailedScheduling event text | Cluster capacity exhausted -- not a node-specific issue | +| Kubelet event at 14:22: `Failed to pull 0b19eca1` on vmss000007 | Datadog Events | Pod `7d75b456c6` scheduled for the first time at 14:22 (previously Pending) | +| Containerd event at 14:22:27: `api-546c59447b-qq74z` on vmss00000a | Datadog Events | Pod `546c59447b` scheduled for the first time at 14:22, pulled successfully | +| mural-api PR #15600 merged to master at 14:19:51 UTC | GitHub API | Master merge triggered mass deployment patches, freeing scheduling slots | +| `integrations-58f47d478` RS created 14:22:07 in namespace | kubectl RS list | Confirms deployments were patched as part of master merge wave | +| Operator code: `getHashToDeploy()` falls back to master | Source code | Test envs without a branch override get master HEAD -- confirms mass patch mechanism | +| effort-center Redis timeouts ~13:43--13:56 UTC (80 errors) | Datadog Logs | effort-center degraded ~30 min before recovery. Relationship to 14:22 events uncertain. | +| FailedScheduling at 14:22: `19 Too many pods, 2 Insufficient cpu` | Datadog Events | Mass deployment patches from master merge saturated cluster | +| Zero API logs for 9.5 hours | Datadog Logs | Container never started -- pod was never placed on a node | +| Zero Redis writes for 9.5 hours | Datadog Logs | No API client ever connected | +| Image `7b2cfa3e` in ACR since 04:57 UTC | ACR query | Image was pullable -- scheduling, not image availability, was the problem | + +### Appendix F: Investigation Checklist + +``` +K8s ReplicaSet history -> complete deployment timeline (10+ image changes) +Datadog FailedScheduling -> both API pods Pending all night (never scheduled) +Datadog containerd event -> first successful schedule on vmss00000a at 14:22 +Datadog kubelet event -> 7d75b456c6 first scheduled on vmss000007 at 14:22 (ImagePullBackOff) +Datadog API/Redis logs -> confirmed 9.5h gap (pod was never on a node) +ACR tags -> image available since 04:57 UTC +Cluster autoscaling -> DISABLED (no capacity relief mechanism) +Operator code review -> no stuck-pod detection logic exists +Current state -> ImagePullBackOff recurring for latest image (systemic) +mural-api master merge -> 14:19 merge triggered mass patches, freed slots at 14:22 +K8s audit logs -> NOT collected (zero diagnostic settings on AKS cluster) +```