Skip to content

[Stability] Messaging E2E: chromium-msg shard fails on real-time-delivery cross-window propagation #57

@TortoiseWolfe

Description

@TortoiseWolfe

Summary

The chromium-msg 1/1 shard of E2E Tests failed on main commit `00edd23`. Investigation of the merged-blob report (artifact ID 6651581858) shows one hard failure plus three flaky retries, all in the messaging-realtime surface. None of them are caused by the three PRs that just merged (#54 RLS test fix, #55 docs, #56 AdminGate test) — the failure surface is tests/e2e/messaging/, which those PRs do not touch.

This issue tracks it as a real symptom rather than dismissing it as flake. STATUS.md already calls this surface out as the dominant flake hotspot ("9 rounds of mitigation. Cause: stale closures, unstable hook refs, hydration timing"). The retry helpers in real-time-delivery.spec.ts have already been bumped from 3 attempts to 5 attempts (5 × 60s = 5 minute budget per assertion). They are still hitting the wall.

Hard failure

Test: `tests/e2e/messaging/real-time-delivery.spec.ts:342` — "Real-time Message Delivery (T098) > should show delivery status (sent → delivered → read)"

Error: `expect(locator).toBeVisible()` failed after 60s.

  • Locator: `getByText(testMessage)` on page2
  • Expected: visible
  • Received: `<element(s) not found>`

Where it fails: line 85 of the `waitForMessageOnPage2` helper. The helper navigates+reloads up to 5 times waiting for the message to appear on the second window. After 5 attempts (~5 min), it gives up. Both retries (`attempt 0` followed by another `attempt 0` — Playwright's retry on failure) failed at the same line.

This is the real-time subscription on page2 not receiving the message that page1 sent. Either:

  1. Supabase Realtime broadcast didn't deliver to the second client's subscription
  2. The subscription was unsubscribed between page1's send and page2's render
  3. The auth session on page2 lost permission to read the conversation mid-test
  4. Hydration timing on page2 missed the message that arrived before subscribe()

Flaky (passed on retry)

File:line Test Pattern
`message-delete-placeholder.spec.ts:341` "should show [Message deleted] placeholder and preserve adjacent messages" initial fail → passed on retry
`message-editing.spec.ts:337` (fails at line 394) "T115: should edit message within 15-minute window" initial fail → passed on retry
`real-time-delivery.spec.ts:310` (fails at line 85) "should deliver message in <500ms between two windows" initial fail → passed on retry

The fact that two of the three flaky failures are at line 85 (same helper) and the hard failure is also at line 85 strongly suggests one root cause across all four: cross-window realtime propagation under CI.

What's been ruled out

The recent merges to main do not touch this surface:

The previous main commit (`62f8a40`) had a green E2E run, so the hard failure either appeared recently or is intermittent enough to slip through. Given STATUS.md's flake history, intermittent is more likely than regression-from-merge.

Plan

  1. Confirm flake-vs-regression by rerunning the failed shard once the in-flight run completes. If rerun is green, this is in the existing flake budget. If rerun is also red, escalate to root-cause analysis.
  2. Pull the actual trace.zip for the specific failure (the merged blob doesn't include per-test traces; need the raw shard artifact). Inspect the page2 timeline: does Realtime ever connect? Does it receive the INSERT event? Does it filter the event out?
  3. Cross-reference `useConversationRealtime.ts` and `useTypingIndicator.ts` — STATUS.md / tracking-doc lists this as the resolved-but-watch-for-regression hotspot. Verify the `useMemo(() => createClient(), [])` wrapper is still in place and the subscription teardown isn't leaking between the two test windows.
  4. Decide whether the helper's 5-attempt × 60s retry budget is the right answer or whether it's masking a real subscription bug. The comment at line 72-76 implies the bump from 3→5 was already a mitigation, not a fix. We may have hit the wall on what retries can absorb.

Acceptance

Either:

  • (a) The rerun is green AND the hard failure is reproducible <30% of the time AND we add a Watch entry in docs/STABILITY-TRACKING.md Family A. Stays open as a watch.
  • (b) Root cause identified, fix landed, hard failure reproduces 0/10 reruns. Closes.

Reference

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggap-auditIdentified during 2026-04-25 planned-vs-shipped auditpriority:p1High — fix soon (stability hotspot, low-hanging fruit, single-decision unlocks)

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions