Skip to content

fix(gateway): improve UX during gateway cold-start initialization #43

@FL-AntoineDurand

Description

@FL-AntoineDurand

Problem

When a user navigates to a project editor page and no gateway is currently allocated for their organization, there is a noticeable delay (potentially 30-60+ seconds) during which the frontend appears broken — the WebSocket connection for YJS collaboration fails repeatedly until the gateway is fully initialized. The user sees no meaningful feedback and may think the application is broken.

Root Cause Analysis

The gateway cold-start involves a multi-step pipeline that creates a window where WebSocket connections fail:

  1. Frontend calls POST /gateway/start → Ganymede allocates a gateway container
  2. Ganymede creates nginx config for org-{uuid}.domain.local → reloads nginx
  3. Ganymede health-checks GET /collab/ping (polls every 200ms, 5s timeout)
  4. Ganymede calls POST /collab/start → gateway loads all backend modules (~1s)
  5. Frontend receives gateway hostname → immediately tries WebSocket connection
  6. First WebSocket triggers lazy project init (room creation, project:init event, permissions fetch)

Timing gaps that cause failures

The following are hypotheses that need confirmation:

  • H1: DNS propagation delay — After nginx reload, org-{uuid}.domain.local may not be immediately resolvable or routable. The health check passes (Ganymede resolves DNS), but the browser may have cached a DNS failure from an earlier attempt.

  • H2: Race between frontend and gateway init — The frontend receives the gateway hostname from GET /orgs/{org_id}/gateway and immediately connects WebSocket. If the gateway's POST /collab/start hasn't completed yet (module loading, WebSocket handler grafting), the connection fails.

  • H3: y-websocket retry behavior — The y-websocket library has its own reconnection logic with backoff. If the first few connection attempts fail during the init window, the backoff delay adds to the perceived wait time.

  • H4: Browser WebSocket connection caching — After a failed WebSocket connection, the browser may cache the failure or apply backoff before retrying.

Confirmation Strategy

For H1 (DNS propagation)

  • Add timing logs in the frontend between receiving gateway_hostname and first successful WebSocket connection
  • Compare with direct IP-based connection to isolate DNS delay
  • Check if the issue reproduces less frequently on subsequent page loads (where DNS is cached)

For H2 (Race condition)

  • Add a /collab/ready endpoint to the gateway that returns 200 only after full initialization
  • Have the frontend poll this endpoint before attempting WebSocket connection
  • Log timestamps: when frontend gets hostname vs when gateway finishes /collab/start

For H3 (y-websocket backoff)

  • Log y-websocket connection attempts and their timing
  • Check WebsocketProvider configuration for maxBackoffTime and initial retry intervals
  • Test with aggressive retry settings (short backoff, max retries)

For H4 (Browser caching)

  • Check browser DevTools Network tab for WebSocket connection timing
  • Compare behavior across browsers (Chrome vs Firefox)

Proposed Fixes

1. Frontend: Show meaningful loading state during gateway init (Quick win)

Instead of showing a broken/empty page while WebSocket retries, show a clear "Initializing workspace..." progress indicator during the cold-start window.

2. Frontend: Gate WebSocket connection on gateway readiness

After receiving gateway_hostname, poll GET /collab/ping (or a new /collab/ready endpoint) before starting the y-websocket connection. This prevents wasted connection attempts and confusing error states.

3. Backend: Return init status from /gateway/start response

Include a flag in the /gateway/start response indicating whether the gateway is freshly allocated (cold start) vs already running (warm). The frontend can adjust its loading UX accordingly.

4. Backend: Ensure /collab/start completes before returning gateway to frontend

Currently, POST /gateway/start triggers /collab/start but the frontend may query GET /orgs/{org_id}/gateway before init completes. Ensure the allocation response is only returned after full initialization.

Related Issues

Context

Discovered during debugging session on feat/rbac-permissions branch. The CORS configuration for gateway containers was also missing (ALLOWED_ORIGINS not set), which has been fixed separately by deriving it from the DOMAIN env var in app-gateway/src/main.ts.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingenhancementNew feature or requestfrontendFrontend application and UIgatewayGateway container lifecycle and routing

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions