Summary
Turn on per-message-deflate (PMD) (RFC 7692) on the cluster's WebSocket server in ws_server.rs. PMD compresses each WebSocket message before it hits the wire and decompresses on the receiving side. With our existing per-broadcast encode pattern (Shape B — encode the broadcast frame once per tick, share the bytes across all subscribers), PMD's CPU cost gets amortized to once per broadcast — not once per subscriber — which makes it cheap relative to the bandwidth saved.
Empirical motivation: the 2026-04-26 clusters_4 benchmark run topped out at 7,250 players on c7i.2xlarge cluster nodes, with the cluster outbound NIC as the binding bottleneck — sustained ~3 Gbps per cluster while broadcast demand at the failing tier was ~9 GB/s. Cluster CPU still had 65% headroom at the ceiling. Reducing the bytes the cluster needs to push out is the most direct lift on the ceiling that doesn't require a hardware change or an architectural refactor; PMD is the cheapest such lift available.
Expected impact
- Bandwidth: 30–50% reduction on broadcast frames. Postcard-encoded
DeltaPayload carries lots of compressible structure: repeated UUIDs across entities, position/velocity floats spatially clustered, header fields that recur across frames. DEFLATE handles this well.
- CPU: with the per-broadcast encode cache pattern, compression runs once per
(cluster, tick) and the compressed bytes are reused for every subscriber. At 4 clusters × 20 Hz that's ~80 compressions/sec total across the deployment, not 80 × N_subscribers/sec. Negligible against the cluster's 65% tick-budget headroom.
- Decompression runs per-subscriber on the client side. For our benchmark swarm-driver (which already shares the decoded result across simulated players via the per-frame decode cache from
arcane_swarm/feat/latency-decomposition), decompression also amortizes via the same cache. For real game clients (one process per player), decompression is per-player — but a real client decompresses one ~70-200 KB frame every 50 ms, well within trivial CPU.
- Ceiling: estimated +30-50% lift on the player-count ceiling for the same NIC bandwidth, given today's bottleneck shape. Subject to measurement.
Why this is "free" specifically for Arcane
PMD has been around since 2015 and most projects don't bother because the benefit-to-cost ratio is mediocre when you compress per-message-per-subscriber: the CPU cost scales with subscriber count, and the bandwidth saved isn't usually the bottleneck. Arcane's broadcast model inverts both: (a) encode-once-fan-out means compression cost is per-broadcast, not per-subscriber, and (b) NIC bandwidth IS the empirical bottleneck. The thing that makes it usually-not-worth-it makes it specifically worth it for us.
Implementation sketch
tokio-tungstenite (the WS library used by ws_server.rs and the swarm clients) supports per-message-deflate via the deflate feature flag. The negotiation is handshake-time: the cluster advertises support; clients accept it; both sides run DEFLATE on outgoing messages and INFLATE on incoming.
Specific touch points:
arcane/crates/arcane-infra/Cargo.toml — enable tokio-tungstenite with the deflate feature.
arcane/crates/arcane-infra/src/ws_server.rs — configure the server-side WebSocketConfig to advertise PMD on the handshake. Test that the negotiated extension shows up in the response headers.
- Swarm client (
arcane_swarm/crates/arcane-swarm/src/bin/arcane_swarm/backends_arcane.rs) — same feature flag in the client connect path so PMD gets accepted and decompression happens automatically.
- Real-game clients — confirm UE5 and Unity native WebSocket bindings support PMD (browser WebSocket does natively). For initial rollout, both sides controlled by us; broader interop to be validated when the UE5 plugin work catches up.
- Per-broadcast encode cache — verify the existing Shape B encode path still benefits: compression should happen after the per-broadcast encode and the compressed bytes should be the unit shared across subscriber sends. Otherwise we'd be compressing N times per broadcast and lose the win.
- Configuration — add a
cluster_ws_deflate_enabled field (default true) so we can A/B at benchmark time and so studios can disable it if their custom clients don't speak PMD.
Out of scope for this issue
- Pluggable transport layer (QUIC, raw UDP). Tracked in arcane#43. PMD is a WebSocket-specific optimization that lives whether or not we eventually move to QUIC.
- Wire-format-level compression (e.g. quantizing position+velocity from f32 to fixed-point). Independent optimization, would compose with PMD additively. Worth a separate issue.
- Delta-only broadcasts (arcane#30). Independent optimization, also composes additively.
Acceptance criteria
Quick win next to it (out of this issue, but worth filing alongside)
Quantize position and velocity from Vec3<f32> (12 B each) to fixed-point or f16 (~6 B each). Drops per-entity state from 56 B → ~36 B (35% reduction) at the cluster end without changing the wire-schema fields, only their representation. Composes with PMD and is independent of it.
Summary
Turn on per-message-deflate (PMD) (RFC 7692) on the cluster's WebSocket server in
ws_server.rs. PMD compresses each WebSocket message before it hits the wire and decompresses on the receiving side. With our existing per-broadcast encode pattern (Shape B — encode the broadcast frame once per tick, share the bytes across all subscribers), PMD's CPU cost gets amortized to once per broadcast — not once per subscriber — which makes it cheap relative to the bandwidth saved.Empirical motivation: the 2026-04-26 clusters_4 benchmark run topped out at 7,250 players on
c7i.2xlargecluster nodes, with the cluster outbound NIC as the binding bottleneck — sustained ~3 Gbps per cluster while broadcast demand at the failing tier was ~9 GB/s. Cluster CPU still had 65% headroom at the ceiling. Reducing the bytes the cluster needs to push out is the most direct lift on the ceiling that doesn't require a hardware change or an architectural refactor; PMD is the cheapest such lift available.Expected impact
DeltaPayloadcarries lots of compressible structure: repeated UUIDs across entities, position/velocity floats spatially clustered, header fields that recur across frames. DEFLATE handles this well.(cluster, tick)and the compressed bytes are reused for every subscriber. At 4 clusters × 20 Hz that's ~80 compressions/sec total across the deployment, not 80 × N_subscribers/sec. Negligible against the cluster's 65% tick-budget headroom.arcane_swarm/feat/latency-decomposition), decompression also amortizes via the same cache. For real game clients (one process per player), decompression is per-player — but a real client decompresses one ~70-200 KB frame every 50 ms, well within trivial CPU.Why this is "free" specifically for Arcane
PMD has been around since 2015 and most projects don't bother because the benefit-to-cost ratio is mediocre when you compress per-message-per-subscriber: the CPU cost scales with subscriber count, and the bandwidth saved isn't usually the bottleneck. Arcane's broadcast model inverts both: (a) encode-once-fan-out means compression cost is per-broadcast, not per-subscriber, and (b) NIC bandwidth IS the empirical bottleneck. The thing that makes it usually-not-worth-it makes it specifically worth it for us.
Implementation sketch
tokio-tungstenite(the WS library used byws_server.rsand the swarm clients) supports per-message-deflate via thedeflatefeature flag. The negotiation is handshake-time: the cluster advertises support; clients accept it; both sides run DEFLATE on outgoing messages and INFLATE on incoming.Specific touch points:
arcane/crates/arcane-infra/Cargo.toml— enabletokio-tungstenitewith thedeflatefeature.arcane/crates/arcane-infra/src/ws_server.rs— configure the server-sideWebSocketConfigto advertise PMD on the handshake. Test that the negotiated extension shows up in the response headers.arcane_swarm/crates/arcane-swarm/src/bin/arcane_swarm/backends_arcane.rs) — same feature flag in the client connect path so PMD gets accepted and decompression happens automatically.cluster_ws_deflate_enabledfield (defaulttrue) so we can A/B at benchmark time and so studios can disable it if their custom clients don't speak PMD.Out of scope for this issue
Acceptance criteria
c7i.2xlargeclusters_4 fleet) with PMD on; benchmark journal entry compares ceiling and bandwidth against the 7,250-player baseline from20260426_060905.arcane-wire— PMD is a transport-layer concern, the encodedServerFrame::Deltabytes inside are unchanged.last_tick_usregression checked: we still have meaningful tick-budget headroom at the new ceiling, i.e. the compression cost didn't push us into a different bottleneck.cluster_ws_deflate_enabled, defaulttrue) lets us A/B test and lets studios disable for clients that can't negotiate PMD.Quick win next to it (out of this issue, but worth filing alongside)
Quantize
positionandvelocityfromVec3<f32>(12 B each) to fixed-point orf16(~6 B each). Drops per-entity state from 56 B → ~36 B (35% reduction) at the cluster end without changing the wire-schema fields, only their representation. Composes with PMD and is independent of it.