From b43b9cb2c37847e22c19672d4b166b62212ae50f Mon Sep 17 00:00:00 2001 From: TurtleWolfe Date: Mon, 27 Apr 2026 10:44:52 +0000 Subject: [PATCH] ci(e2e): fail loud when wait-on times out, capture serve diagnostics MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit CI run 24970006226 on commit 00edd23 had 14/26 jobs fail. All 7 firefox shards and all 6 webkit-gen shards failed with NS_ERROR_CONNECTION_REFUSED to http://localhost:3000/account in their first beforeEach. The static server (npx serve out -l 3000) was never responding when Playwright tried to connect, but the workflow proceeded to run tests anyway. Root cause of the silent failure: the prior 'Start server' step ran 'npx serve ... &; sleep 5; npx wait-on ... --timeout 60000', and despite GitHub Actions' default '-eo pipefail' shell, wait-on's failure was not reliably propagating as a step error — likely an interaction between '-e' and the backgrounded job. So tests ran, every test cascaded to ECONNREFUSED, and 50 minutes of CI per shard produced no actionable signal beyond 'connection refused.' Fix: replace all 6 identical 'Start server' blocks with a defensive version that: 1. Captures serve's PID and tees output to serve.log 2. Verifies serve didn't exit immediately 3. Explicitly checks wait-on's exit code via 'if ! ... ; then ...' 4. On failure, dumps serve.log, listening sockets (ss/netstat), serve process state (ps), and a direct curl probe — then exit 1 5. On success, prints a confirmation line for grep-friendly logs Affected jobs: smoke, rate-limiting, auth-setup, e2e (chromium-gen + chromium-msg), e2e-firefox, e2e-webkit. Six identical blocks updated in place; preserves all existing 'env: CI: true' attachments on the e2e/firefox/webkit jobs. After this lands, the next firefox/webkit cascade will fail in ~90s with captured diagnostics pointing at the actual root cause, instead of a 50-minute silent ECONNREFUSED storm. That signal will tell us whether to bump the wait-on timeout, switch the static server, or fix something else entirely. Currently the cascade noise drowns out the cause. Out of scope: - Cross-shard test-user collisions in messaging shards (#50, #57) - The webkit-gen test failures themselves (theme-switching, etc.) — separate investigations once we know whether they're real or a cascade symptom of the serve-died problem this fixes. Co-Authored-By: Claude Opus 4.7 (1M context) --- .github/workflows/e2e.yml | 210 ++++++++++++++++++++++++++++++++++---- 1 file changed, 192 insertions(+), 18 deletions(-) diff --git a/.github/workflows/e2e.yml b/.github/workflows/e2e.yml index 98d60e77..c77482ca 100644 --- a/.github/workflows/e2e.yml +++ b/.github/workflows/e2e.yml @@ -111,9 +111,38 @@ jobs: - name: Start server run: | - npx serve out -l 3000 & - sleep 5 - npx wait-on http://localhost:3000 --timeout 60000 + set -euo pipefail + npx serve out -l 3000 > serve.log 2>&1 & + SERVE_PID=$! + echo "serve started, PID=$SERVE_PID" + + # Brief grace period for the child to spawn; wait-on does the real polling. + sleep 2 + + # Verify serve didn't already exit (port in use, missing out/, etc.). + if ! kill -0 "$SERVE_PID" 2>/dev/null; then + echo "::error::serve exited immediately after launch" + echo "--- serve.log ---" + cat serve.log || true + exit 1 + fi + + # Explicit exit-code propagation. Don't rely on -e interacting cleanly + # with backgrounded jobs. + if ! npx wait-on http://localhost:3000 --timeout 60000 --verbose; then + echo "::error::wait-on timed out — serve never bound to port 3000 within 60s" + echo "--- serve.log ---" + cat serve.log || true + echo "--- listening sockets ---" + ss -tlnp 2>/dev/null || netstat -tlnp 2>/dev/null || true + echo "--- serve process state ---" + ps -fp "$SERVE_PID" 2>/dev/null || echo "serve process $SERVE_PID is gone" + echo "--- direct curl attempt ---" + curl -v --max-time 5 http://localhost:3000/ 2>&1 || true + exit 1 + fi + + echo "serve is responding on http://localhost:3000" - name: Run smoke tests (sign-up only - auth tests run after auth-setup) run: | @@ -183,9 +212,38 @@ jobs: - name: Start server run: | - npx serve out -l 3000 & - sleep 5 - npx wait-on http://localhost:3000 --timeout 60000 + set -euo pipefail + npx serve out -l 3000 > serve.log 2>&1 & + SERVE_PID=$! + echo "serve started, PID=$SERVE_PID" + + # Brief grace period for the child to spawn; wait-on does the real polling. + sleep 2 + + # Verify serve didn't already exit (port in use, missing out/, etc.). + if ! kill -0 "$SERVE_PID" 2>/dev/null; then + echo "::error::serve exited immediately after launch" + echo "--- serve.log ---" + cat serve.log || true + exit 1 + fi + + # Explicit exit-code propagation. Don't rely on -e interacting cleanly + # with backgrounded jobs. + if ! npx wait-on http://localhost:3000 --timeout 60000 --verbose; then + echo "::error::wait-on timed out — serve never bound to port 3000 within 60s" + echo "--- serve.log ---" + cat serve.log || true + echo "--- listening sockets ---" + ss -tlnp 2>/dev/null || netstat -tlnp 2>/dev/null || true + echo "--- serve process state ---" + ps -fp "$SERVE_PID" 2>/dev/null || echo "serve process $SERVE_PID is gone" + echo "--- direct curl attempt ---" + curl -v --max-time 5 http://localhost:3000/ 2>&1 || true + exit 1 + fi + + echo "serve is responding on http://localhost:3000" - name: Run rate-limiting tests (ordered) run: pnpm test:e2e --project=rate-limiting --project=brute-force --project=signup --reporter=list --trace=on-first-retry @@ -245,9 +303,38 @@ jobs: - name: Start server run: | - npx serve out -l 3000 & - sleep 5 - npx wait-on http://localhost:3000 --timeout 60000 + set -euo pipefail + npx serve out -l 3000 > serve.log 2>&1 & + SERVE_PID=$! + echo "serve started, PID=$SERVE_PID" + + # Brief grace period for the child to spawn; wait-on does the real polling. + sleep 2 + + # Verify serve didn't already exit (port in use, missing out/, etc.). + if ! kill -0 "$SERVE_PID" 2>/dev/null; then + echo "::error::serve exited immediately after launch" + echo "--- serve.log ---" + cat serve.log || true + exit 1 + fi + + # Explicit exit-code propagation. Don't rely on -e interacting cleanly + # with backgrounded jobs. + if ! npx wait-on http://localhost:3000 --timeout 60000 --verbose; then + echo "::error::wait-on timed out — serve never bound to port 3000 within 60s" + echo "--- serve.log ---" + cat serve.log || true + echo "--- listening sockets ---" + ss -tlnp 2>/dev/null || netstat -tlnp 2>/dev/null || true + echo "--- serve process state ---" + ps -fp "$SERVE_PID" 2>/dev/null || echo "serve process $SERVE_PID is gone" + echo "--- direct curl attempt ---" + curl -v --max-time 5 http://localhost:3000/ 2>&1 || true + exit 1 + fi + + echo "serve is responding on http://localhost:3000" - name: Run auth setup run: pnpm exec playwright test --project=setup --reporter=list --timeout=180000 @@ -352,9 +439,38 @@ jobs: - name: Start server run: | - npx serve out -l 3000 & - sleep 5 - npx wait-on http://localhost:3000 --timeout 60000 + set -euo pipefail + npx serve out -l 3000 > serve.log 2>&1 & + SERVE_PID=$! + echo "serve started, PID=$SERVE_PID" + + # Brief grace period for the child to spawn; wait-on does the real polling. + sleep 2 + + # Verify serve didn't already exit (port in use, missing out/, etc.). + if ! kill -0 "$SERVE_PID" 2>/dev/null; then + echo "::error::serve exited immediately after launch" + echo "--- serve.log ---" + cat serve.log || true + exit 1 + fi + + # Explicit exit-code propagation. Don't rely on -e interacting cleanly + # with backgrounded jobs. + if ! npx wait-on http://localhost:3000 --timeout 60000 --verbose; then + echo "::error::wait-on timed out — serve never bound to port 3000 within 60s" + echo "--- serve.log ---" + cat serve.log || true + echo "--- listening sockets ---" + ss -tlnp 2>/dev/null || netstat -tlnp 2>/dev/null || true + echo "--- serve process state ---" + ps -fp "$SERVE_PID" 2>/dev/null || echo "serve process $SERVE_PID is gone" + echo "--- direct curl attempt ---" + curl -v --max-time 5 http://localhost:3000/ 2>&1 || true + exit 1 + fi + + echo "serve is responding on http://localhost:3000" env: CI: true @@ -482,9 +598,38 @@ jobs: path: tests/e2e/fixtures/ - name: Start server run: | - npx serve out -l 3000 & - sleep 5 - npx wait-on http://localhost:3000 --timeout 60000 + set -euo pipefail + npx serve out -l 3000 > serve.log 2>&1 & + SERVE_PID=$! + echo "serve started, PID=$SERVE_PID" + + # Brief grace period for the child to spawn; wait-on does the real polling. + sleep 2 + + # Verify serve didn't already exit (port in use, missing out/, etc.). + if ! kill -0 "$SERVE_PID" 2>/dev/null; then + echo "::error::serve exited immediately after launch" + echo "--- serve.log ---" + cat serve.log || true + exit 1 + fi + + # Explicit exit-code propagation. Don't rely on -e interacting cleanly + # with backgrounded jobs. + if ! npx wait-on http://localhost:3000 --timeout 60000 --verbose; then + echo "::error::wait-on timed out — serve never bound to port 3000 within 60s" + echo "--- serve.log ---" + cat serve.log || true + echo "--- listening sockets ---" + ss -tlnp 2>/dev/null || netstat -tlnp 2>/dev/null || true + echo "--- serve process state ---" + ps -fp "$SERVE_PID" 2>/dev/null || echo "serve process $SERVE_PID is gone" + echo "--- direct curl attempt ---" + curl -v --max-time 5 http://localhost:3000/ 2>&1 || true + exit 1 + fi + + echo "serve is responding on http://localhost:3000" env: CI: true - name: Prime Supabase connection pool @@ -596,9 +741,38 @@ jobs: path: tests/e2e/fixtures/ - name: Start server run: | - npx serve out -l 3000 & - sleep 5 - npx wait-on http://localhost:3000 --timeout 60000 + set -euo pipefail + npx serve out -l 3000 > serve.log 2>&1 & + SERVE_PID=$! + echo "serve started, PID=$SERVE_PID" + + # Brief grace period for the child to spawn; wait-on does the real polling. + sleep 2 + + # Verify serve didn't already exit (port in use, missing out/, etc.). + if ! kill -0 "$SERVE_PID" 2>/dev/null; then + echo "::error::serve exited immediately after launch" + echo "--- serve.log ---" + cat serve.log || true + exit 1 + fi + + # Explicit exit-code propagation. Don't rely on -e interacting cleanly + # with backgrounded jobs. + if ! npx wait-on http://localhost:3000 --timeout 60000 --verbose; then + echo "::error::wait-on timed out — serve never bound to port 3000 within 60s" + echo "--- serve.log ---" + cat serve.log || true + echo "--- listening sockets ---" + ss -tlnp 2>/dev/null || netstat -tlnp 2>/dev/null || true + echo "--- serve process state ---" + ps -fp "$SERVE_PID" 2>/dev/null || echo "serve process $SERVE_PID is gone" + echo "--- direct curl attempt ---" + curl -v --max-time 5 http://localhost:3000/ 2>&1 || true + exit 1 + fi + + echo "serve is responding on http://localhost:3000" env: CI: true - name: Prime Supabase connection pool