Enable per-message-deflate WebSocket compression on cluster broadcast pipeline

## Summary

Turn on **per-message-deflate (PMD)** ([RFC 7692](https://datatracker.ietf.org/doc/html/rfc7692)) on the cluster's WebSocket server in `ws_server.rs`. PMD compresses each WebSocket message before it hits the wire and decompresses on the receiving side. With our existing per-broadcast encode pattern (Shape B — encode the broadcast frame once per tick, share the bytes across all subscribers), PMD's CPU cost gets amortized to once per broadcast — not once per subscriber — which makes it cheap relative to the bandwidth saved.

Empirical motivation: the [2026-04-26 clusters_4 benchmark run](https://github.com/brainy-bots/arcane-scaling-benchmarks/blob/main/results/runs/AwsArcanePerHost/20260426_060905) topped out at 7,250 players on `c7i.2xlarge` cluster nodes, with the **cluster outbound NIC** as the binding bottleneck — sustained ~3 Gbps per cluster while broadcast demand at the failing tier was ~9 GB/s. Cluster CPU still had 65% headroom at the ceiling. Reducing the bytes the cluster needs to push out is the most direct lift on the ceiling that doesn't require a hardware change or an architectural refactor; PMD is the cheapest such lift available.

## Expected impact

- **Bandwidth**: 30–50% reduction on broadcast frames. Postcard-encoded `DeltaPayload` carries lots of compressible structure: repeated UUIDs across entities, position/velocity floats spatially clustered, header fields that recur across frames. DEFLATE handles this well.
- **CPU**: with the per-broadcast encode cache pattern, compression runs once per `(cluster, tick)` and the compressed bytes are reused for every subscriber. At 4 clusters × 20 Hz that's ~80 compressions/sec total across the deployment, not 80 × N_subscribers/sec. Negligible against the cluster's 65% tick-budget headroom.
- **Decompression** runs per-subscriber on the client side. For our benchmark swarm-driver (which already shares the decoded result across simulated players via the per-frame decode cache from `arcane_swarm/feat/latency-decomposition`), decompression also amortizes via the same cache. For real game clients (one process per player), decompression is per-player — but a real client decompresses one ~70-200 KB frame every 50 ms, well within trivial CPU.
- **Ceiling**: estimated +30-50% lift on the player-count ceiling for the same NIC bandwidth, given today's bottleneck shape. Subject to measurement.

## Why this is "free" specifically for Arcane

PMD has been around since 2015 and most projects don't bother because the benefit-to-cost ratio is mediocre when you compress per-message-per-subscriber: the CPU cost scales with subscriber count, and the bandwidth saved isn't usually the bottleneck. Arcane's broadcast model inverts both: (a) encode-once-fan-out means compression cost is per-broadcast, not per-subscriber, and (b) NIC bandwidth IS the empirical bottleneck. The thing that makes it usually-not-worth-it makes it specifically worth it for us.

## Implementation sketch

`tokio-tungstenite` (the WS library used by `ws_server.rs` and the swarm clients) supports per-message-deflate via the `deflate` feature flag. The negotiation is handshake-time: the cluster advertises support; clients accept it; both sides run DEFLATE on outgoing messages and INFLATE on incoming.

Specific touch points:

1. **`arcane/crates/arcane-infra/Cargo.toml`** — enable `tokio-tungstenite` with the `deflate` feature.
2. **`arcane/crates/arcane-infra/src/ws_server.rs`** — configure the server-side `WebSocketConfig` to advertise PMD on the handshake. Test that the negotiated extension shows up in the response headers.
3. **Swarm client** (`arcane_swarm/crates/arcane-swarm/src/bin/arcane_swarm/backends_arcane.rs`) — same feature flag in the client connect path so PMD gets accepted and decompression happens automatically.
4. **Real-game clients** — confirm UE5 and Unity native WebSocket bindings support PMD (browser WebSocket does natively). For initial rollout, both sides controlled by us; broader interop to be validated when the UE5 plugin work catches up.
5. **Per-broadcast encode cache** — verify the existing Shape B encode path still benefits: compression should happen *after* the per-broadcast encode and the compressed bytes should be the unit shared across subscriber sends. Otherwise we'd be compressing N times per broadcast and lose the win.
6. **Configuration** — add a `cluster_ws_deflate_enabled` field (default `true`) so we can A/B at benchmark time and so studios can disable it if their custom clients don't speak PMD.

## Out of scope for this issue

- **Pluggable transport layer** (QUIC, raw UDP). Tracked in [arcane#43](https://github.com/brainy-bots/arcane/issues/43). PMD is a WebSocket-specific optimization that lives whether or not we eventually move to QUIC.
- **Wire-format-level compression** (e.g. quantizing position+velocity from f32 to fixed-point). Independent optimization, would compose with PMD additively. Worth a separate issue.
- **Delta-only broadcasts** ([arcane#30](https://github.com/brainy-bots/arcane/issues/30)). Independent optimization, also composes additively.

## Acceptance criteria

- [ ] PMD negotiated on every cluster WebSocket connection (verified via handshake response headers in an integration test).
- [ ] Cluster runs the existing benchmark (full mesh, 5,750 → 7,500 player sweep on `c7i.2xlarge` clusters_4 fleet) with PMD on; benchmark journal entry compares ceiling and bandwidth against the 7,250-player baseline from `20260426_060905`.
- [ ] No change to the wire schema in `arcane-wire` — PMD is a transport-layer concern, the encoded `ServerFrame::Delta` bytes inside are unchanged.
- [ ] Cluster CPU `last_tick_us` regression checked: we still have meaningful tick-budget headroom at the new ceiling, i.e. the compression cost didn't push us into a different bottleneck.
- [ ] Configuration toggle (`cluster_ws_deflate_enabled`, default `true`) lets us A/B test and lets studios disable for clients that can't negotiate PMD.

## Quick win next to it (out of this issue, but worth filing alongside)

Quantize `position` and `velocity` from `Vec3<f32>` (12 B each) to fixed-point or `f16` (~6 B each). Drops per-entity state from 56 B → ~36 B (35% reduction) at the cluster end without changing the wire-schema *fields*, only their representation. Composes with PMD and is independent of it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable per-message-deflate WebSocket compression on cluster broadcast pipeline #44

Summary

Expected impact

Why this is "free" specifically for Arcane

Implementation sketch

Out of scope for this issue

Acceptance criteria

Quick win next to it (out of this issue, but worth filing alongside)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Enable per-message-deflate WebSocket compression on cluster broadcast pipeline #44

Description

Summary

Expected impact

Why this is "free" specifically for Arcane

Implementation sketch

Out of scope for this issue

Acceptance criteria

Quick win next to it (out of this issue, but worth filing alongside)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions