Skip to content

Epic: Unreal Cluster Node — implementation plan applying Rapier learnings #124

@martinjms

Description

@martinjms

Type: epic / implementation plan. Concrete how-to-build for the Unreal Cluster Node, the second concrete physics backend named in #8. Applies architectural learnings from the Rapier integration (#117 / #118 / #123) and the unified-entity / terrain model decisions captured in docs/architecture/entity-model.md.

Quick Summary

  • 🟡 Status: ready to design + build. Architectural prerequisites are landed (Rapier proves the multi-backend pattern; entity model and terrain epics are documented).
  • Goal: an Unreal Dedicated Server running Chaos physics as an Arcane cluster — wire-compatible with Rust clusters, drop-in for the same cluster_runner protocol, zero extra cluster-imposed limitations vs. local Unreal/Chaos.
  • Default integration shape (from #8 §"Integration shapes" Batching: replication and SpacetimeDB writes per tick/snapshot #2): Unreal-native. Chaos stays inside the UE process; UE talks Arcane wire format directly to Redis + WS clients + Manager. No Rust/UE FFI for the hot path.
  • Reuses the Rapier-validated patterns: unified entity model (no separate Structure concept), composition (cluster wraps a user-provided game-logic interface), per-entity hooks for body kind / collider / material / sensor / groups, one-tick contact-event delay, wire-format-driven replication.

Why This Matters

Unreal is the modal AAA multiplayer engine. Studios using Unreal want Chaos-on-Chaos parity between client and server because anything else means reconciliation drift, predicted-vs-confirmed visual jitter, and weeks of engineering pain to compensate. A Rapier-based cluster is great for ambient / mid-tier entities and Rust-engine games; an Unreal cluster is the path to player-tier entities in Unreal-engine games.

Without an Unreal Cluster Node:

  • Unreal studios cannot put player-tier entities (combat-grade physics) on Arcane at all — they default back to single-process Unreal Dedicated Servers and inherit the player-count ceiling Arcane was built to break.
  • The heterogeneous-tier vision in #33 is incomplete — there's no engine-parity tier for Unreal games.
  • "Arcane works for any engine" remains a claim, not a fact.

Architectural decisions (locked in by the Rapier work and the per-engine API discipline)

These flow from entity-model.md (the canonical doc) and the Rapier integration. Restated for the Unreal implementer with the corrections from the 2026-05-03 architecture sessions:

  1. Unified entity model. Players, NPCs, projectiles, structures, dropped items — all are Arcane entities with a SpacetimeDB durable row. The Unreal cluster maps each entity to an AActor. No separate AStaticMeshActor vs APawn distinction at the Arcane layer.
  2. Per-engine API, NOT shared types. ⚠️ Correction from earlier framing — the UE plugin uses UE-native types for everything user-facing. There is no EArcaneBodyKind mirroring Rapier's RapierBodyKind. UE devs configure their AActor's Mobility (Static / Stationary / Movable) and the plugin reads it directly. Game logic is written in idiomatic UE C++; Arcane vocabulary lives in docs, not in parallel UE enums. See entity-model.md §8 and §9.
  3. Composition over inheritance — engine-named base class. Game's UE C++ entity classes extend AArcaneUnrealEntity (which itself extends AActor). The plugin reads UE-native properties (Mobility, attached UPrimitiveComponents, UPhysicalMaterial) to derive what gets onto the wire. The user doesn't override the cluster process itself; they author engine-native actors and override engine-native hooks where customization is needed.
  4. Wire-format compatibility is a hard contract. Same EntityStateEntry shape on Redis (postcard) and WS (binary arcane-wire) as Rust clusters. Cross-engine clusters in the same deployment must speak identical bytes. Achieved via thin C-API to the same arcane-wire Rust serializer (see Networking section), or careful C++ reimplementation.
  5. One-tick contact-event delay. User logic runs first to set intent; physics produces output; contacts surface in the next tick's user hook. Same as Rapier; same reasoning.
  6. Spawn-time customization is engine-native. UE devs configure their AArcaneUnrealEntity subclass via UE's standard means: Mobility for body kind, attached collision components for shape, UPhysicalMaterial for friction/restitution, collision profile for groups, sensor mode via collision response = Overlap-only. Plugin maps these to physics behavior on the C++ side. No mirror of #120's spawn hooks at the Arcane API level — UE's existing component model is the API.
  7. In-tick imperative ops are entity-keyed. IArcaneUnrealPhysics::ApplyImpulse(EntityId, FVector), Raycast(...) etc. — never expose raw AActor* or Chaos handles to the user code. Preserves the wire-format invariant. Same shape as Rapier's #121 proposes; different language, different types.
  8. Public C++ API stability. UE-equivalent of #[non_exhaustive]: pImpl / forward declarations / interface-stable virtuals so adding fields and methods doesn't break downstream UE projects.
  9. Terrain is loaded by the Arcane runtime, not by user code — via IArcaneUnrealMapProvider. UE plugin defines this interface (parallel to Rapier's RapierMapProvider from #119). Game implements it returning UE-native collision (UStaticMesh* / ULandscapeComponent* / AStaticMeshActor*). Runtime computes chunks-needed from entity positions, calls the provider, integrates with World Partition cell streaming.
  10. Cross-engine game logic is the developer's responsibility. A game using both UE Cluster Node (premium tier) and a Rapier cluster (mid tier) writes two parallel game-logic codebases, one per engine plugin. Cross-engine consistency for game rules (damage formulas, drop tables, currency) lives in SpacetimeDB reducers called from every engine's cluster binary. Migration between tiers happens at cluster-process boundaries; durable state in SpacetimeDB is the lingua franca. See entity-model.md §10.

Coordinate system + units (boundary translation)

UE is Z-up, left-handed, units = cm. Arcane wire is implicitly Y-up, right-handed, units = m (Rapier-flavored). The UE plugin handles the boundary translation:

  • Axis swap Y ↔ Z on every Vec3 (position, velocity) entering / leaving the wire format.
  • Unit scale ×100 on inbound (m → cm) / ×0.01 on outbound (cm → m).
  • Capsule axis: UE capsules are Z-up by default; the wire's capsule convention is Y-up. Convert at the boundary if a wire-format capsule shape is ever exposed (in practice, since UE plugin uses UE-native components, capsule axis is whatever the UE component is).

Document this explicitly in the ADR and in entity-model.md (or a sibling wire-coordinate-conventions.md if length warrants).

Conceptual cross-reference: Rapier vocabulary ↔ UE/Chaos vocabulary

This table is documentation cross-reference, not a type-mirroring spec. The Rapier and UE plugins each use their own engine-native types; the table just shows how the concepts translate so a Rapier-experienced dev can find the equivalent UE concept and vice versa.

Rapier (Rust) UE / Chaos analog Note
RapierClusterSim (struct, impls ClusterSimulation) UArcaneClusterSubsystem (UE UGameInstanceSubsystem or custom AInfo actor) Owns the cluster tick
RapierClusterSimulation (trait) IArcaneClusterSimulation (UE UInterface) Game-defined per-cluster behavior
RapierClusterTickContext<'a> FArcaneTickContext (USTRUCT, passed by ref to OnTick) Tick-scoped context
RapierColliderShape::Ball/Capsule/Cuboid USphereComponent / UCapsuleComponent / UBoxComponent (or Chaos collision) Per-entity collider
RapierBodyKind::{Dynamic, KinematicPositionBased, KinematicVelocityBased, Fixed} EComponentMobility::{Movable, Stationary, Static} + simulation flags UE Mobility maps closely; some translation needed
RapierMaterial { friction, restitution, density } UPhysicalMaterial UE has the same concept; assign on collider
is_sensor(entry) -> bool Collision response = Overlap-only (no Block) UE's sensor analog
collision_groups_for(entry) UE collision channels + presets UE has 32 channels; map Arcane groups to them
apply_impulse(entity_id, impulse) body->AddImpulse(...) on the entity's actor Entity-keyed
apply_force(entity_id, force) body->AddForce(...)
set_translation(entity_id, position) actor->SetActorLocation(...) (with sweep)
raycast(origin, direction, max_dist) World->LineTraceSingleByChannel(...)
intersections_with_shape(shape, position) World->OverlapMultiByChannel(...)
create_joint(a, b, joint_spec) UE's UPhysicsConstraintComponent
ContactEvent { entity_a, entity_b, started } UPrimitiveComponent::OnComponentBeginOverlap / OnComponentEndOverlap (or OnComponentHit for blocking) Mapped to entity_id pairs via reverse-map

Default integration shape: Unreal-native

From #8's "Integration shapes (1/2/3)", we land on shape #2 — Unreal-native cluster:

  • One process: Unreal Dedicated Server. Chaos runs in-engine.
  • Cluster networking is in UE C++. UE talks to Redis (cpp-redis or hiredis), WS clients (uWebSockets or Boost.Beast or a thin Rust FFI shim), Arcane Manager (UE's FHttpModule).
  • No Rust/UE FFI for the hot path. Wire-format serialization (postcard) is the one place an FFI shim might live, since reimplementing postcard in C++ is non-trivial — but even that can be a thin C-API wrapper if we choose. Decision deferred until implementation.
  • Same wire bytes as Rust clusters. A heterogeneous deployment (some Rust clusters, some Unreal clusters) just works at the protocol level.

Why not shape #1 (Rust cluster + Unreal sidecar) or shape #3 (Hybrid Rust↔Unreal FFI)?

The decision should still be captured in an ADR per #8's acceptance criteria — the ADR can reference this epic as the implementation plan.

What about reusing the Rust networking primitives?

The Rapier work uses arcane-infra::cluster_runner::run_cluster_loop which contains the WS server, Redis pub/sub, neighbor merge, stats HTTP, and persistence path. The Unreal cluster needs the same shape but in UE.

Two paths to consider during implementation (decision in ADR):

  • Reimplement in C++: ~500-1000 lines of UE C++ wrapping cpp-redis, a WS server library, and FHttpModule. Maintenance: two networking codebases, must stay byte-compatible.
  • Thin FFI bridge: expose arcane-infra networking as a C-API; UE calls into it via a single arcane_cluster_native shared library. Maintenance: one codebase; FFI surface is small (a few function pointers for "publish delta," "drain neighbor deltas," "send to WS clients").

Lean toward the FFI bridge because: same byte format guaranteed; bug fixes flow once; matches existing arcane-client-unreal plugin's likely structure.

Implementation work breakdown

Each line below is a candidate sub-issue once this epic is approved. Sized to roughly map to the Rapier sub-issues (#117, #118, #120, #121, #122).

# Slice Roughly equivalent to
1 Plugin scaffold + build. UE Dedicated Server target builds; loads an Arcane-cluster module. Empty cluster runs the tick loop and prints stats. #117-equivalent: minimum integration
2 Networking layer. Redis pub/sub, WS server, Manager HTTP. Wire-format serialization (probably FFI shim to arcane-infra initially). foundational
3 Entity ↔ Actor mapping. First-sight spawn creates an AActor; despawn destroys it. Default Static Mesh + sphere collider. Sync entity.position/velocity ↔ actor transform / linvel. #117-style entity binding
4 IArcaneUnrealClusterSimulation interface + tick dispatch. Game's UE module implements OnTick(FArcaneTickContext&); cluster calls it. UE-native, not a mirror of RapierClusterSimulation. #118-equivalent for UE
5 Engine-native spawn-time customization. Game devs configure their AArcaneUnrealEntity subclass with UE-native Mobility, attached UPrimitiveComponent colliders, UPhysicalMaterial, collision profile. Plugin reads these. No BodyKindFor mirror enum at the Arcane API level. #120-equivalent in function, UE-native in form.
6 Contact events. UE OnComponentBeginOverlap / OnComponentEndOverlap (and OnComponentHit for blocking) → buffered as FArcaneContactEvent, surfaced one tick later in FArcaneTickContext::ContactEvents. #118-equivalent
7 In-tick imperative ops. ApplyImpulse(EntityId, FVector), ApplyForce, SetTranslation, Raycast, IntersectionsWithShape, CreateJoint(EntityId a, EntityId b, …) — all entity-keyed, return UE-native types where applicable. #121-equivalent
8 IArcaneUnrealMapProvider for terrain. Game implements; returns UE-native collision (UStaticMesh* / ULandscapeComponent*). Runtime computes chunks-needed and integrates with World Partition cell streaming. #119-equivalent for UE
9 Coordinate / unit translation at the boundary. Y↔Z swap and ×100 / ×0.01 unit scale on every wire-format entry/exit. Capsule axis convention. Tested explicitly. new — UE-specific
10 End-to-end smoke test. Headless UE Dedicated Server runs as cluster; pushes physics-evolved deltas to a real Redis; a small Rust test client connects and verifies wire format. Heterogeneous deployment test: a Rapier cluster and an Unreal cluster in the same deployment exchange neighbor deltas correctly. parallel of the Rapier binary smoke test
11 Gap inventory tracker. Living document — what's reachable now from Unreal/Chaos game code via the plugin's API vs locally in Unreal; what's blocked on which architectural epic. #122-equivalent for UE

Open design questions

  • UE version pin. UE 5.4 LTS (most stable) vs 5.5 (current) vs 5.6 (newest). The Chaos API has shifted between minor versions. Decision needed in the ADR.
  • Plugin vs source distribution. Marketplace plugin (closed-source-ish friendly) vs source plugin in arcane-client-unreal (rename to arcane-unreal since it'd contain server-side code) vs new repo arcane-unreal-server.
  • Networking implementation: C++ native vs FFI bridge to arcane-infra. ADR should call this out explicitly.
  • Unit-test strategy. UE Automation tests work but physics-driven correctness is hard to assert in-engine. Possibly: integration tests that step the engine programmatically; rely on the smoke-test path for end-to-end confidence; carefully document contracts.
  • Postcard / arcane-wire serialization in C++. Either reimplement (small surface, but maintenance overhead) or FFI to Rust (zero drift, small FFI). Lean FFI.
  • Hot reload during development. UE's hot reload can corrupt physics state mid-session. Probably the dev workflow is "stop server, rebuild, restart" — same as today's UE Dedicated Server work.
  • Determinism / cross-machine reconciliation. Same problem as Rapier (Chaos is deterministic per-process per-build, not across hardware). Reconciliation strategy is a separate cross-cutting concern.

What's NOT in this epic

  • Cross-cluster physics (kinematic neighbor proxies). Same multi-cluster problem as the Rapier work; deferred. When that epic lands, Unreal proxies and Rapier proxies should look the same on the wire.
  • Unreal-native client improvements. Game-client work is arcane-client-unreal's scope, separate from cluster-server work. They share serialization / wire format but solve different problems.
  • Map authoring tooling. Standard UE level / World Partition workflow.
  • Asset pipeline. Same as standard UE Dedicated Server builds.
  • Determinism story across Rust and Unreal clusters. Genuinely hard; separate epic when needed.

Acceptance criteria

  • ADR landed in arcane/docs/architecture/adr/ capturing: integration shape (Unreal-native), UE version pin, networking-implementation decision (C++ native vs FFI), tick/substep policy, build system layout.
  • Plugin scaffold + Dedicated Server target builds as a valid UE Dedicated Server.
  • Empty cluster runs: binary starts, connects to Redis, accepts WS clients, runs the tick loop without errors. Equivalent of the Rapier smoke test in #123.
  • Entity ↔ Actor mapping with default sphere collider works end-to-end.
  • IArcaneClusterSimulation + per-entity hooks match the Rapier API surface (#120-equivalent).
  • In-tick imperative ops match #121-equivalent surface.
  • Contact events surface with one-tick delay, mapped from UE's overlap/hit events.
  • Wire-format byte-compatibility with Rust clusters demonstrated by: a Rust cluster and an Unreal cluster in the same deployment exchanging neighbor deltas correctly.
  • End-to-end smoke test in CI (or documented manual run) — headless UE cluster + Redis + Rust test client.
  • Gap inventory issue filed mirroring #122's pattern, listing what's reachable from Unreal cluster code vs locally.
  • Architecture doc updates if the implementation reveals anything not yet captured (e.g., specific Chaos integration points that need calling out).

Rapier learnings worth re-stating for the Unreal implementer

These are the things the Rapier integration got wrong first and corrected — worth getting right from the start in Unreal:

  1. Don't expose engine handles to user code. Early Rapier sketches passed &mut RigidBodySet to user code. Footgun: off-spine bodies, cross-cluster joint surprises. The right shape is entity-keyed methods on a tick-context handle. Same for UE: never pass raw AActor* or Chaos handles to game code; expose entity-keyed helper methods on IArcaneUnrealPhysics instead.
  2. Don't mirror types across engines. Earlier framing of this epic implied UE should mirror RapierBodyKind as EArcaneBodyKind. Wrong. UE plugin uses UE-native Mobility directly; the conceptual mapping lives in docs (entity-model.md §8), not in parallel UE enums. Same for collider shapes, materials, collision groups. Per-engine types, shared concept.
  3. Test every documented contract. The Rapier work has 38 unit tests; every doc sentence has a test that breaks if the doc breaks. Aim for the equivalent — even if Unreal's testing story is harder, every contract in the API needs a way to verify it (smoke test, integration test, or documented manual check).
  4. API stability discipline from day 1. #[non_exhaustive] is the Rust equivalent of UE's pImpl / forward declarations / interface-stable virtuals. Bake in the discipline before any external UE project touches the API.
  5. End-to-end smoke test as part of "done." Unit tests prove the wrapper composes correctly; they don't prove the binary runs against Redis with a real WS client. The Rapier work's 1000-tick smoke test against running Redis was the integration confidence that no unit test could provide.
  6. Skill-driven review pass. When the implementation lands, run the simplify and security-review skills the same way the Rapier work did. The simplify pass on Rapier surfaced ~8 real findings (Vec3 helpers, HashSet for removals, env-parser dedup, Mutex-poison handling, etc.) that wouldn't have shown up in normal review.
  7. Pre-push verification matrix must include formatter check. PR #123 failed CI on first push because cargo fmt --check flagged long-line formatting that the local build/test/clippy run didn't catch. UE equivalent: include UE's standard formatting check (.uproject / .uplugin consistency, header guards, IWYU if enabled) in the pre-push checks alongside compile + Automation tests.
  8. Strip release-stage labels ("v1", "v2", "MVP") from doc comments and code — describe by capability, not by stage. Memory feedback_no_release_stage_framing.md enforces this.
  9. The empty-cluster invariant matters. Don't run physics work for a cluster with zero entities. UE's tick can be skipped or no-op'd in this case. Same logic as Rapier's "skip step when accumulator empty + no entities" — Rapier's sleep mechanism + Fixed-body solver-skip preserved this; UE's tick-rate management has the equivalent.
  10. Entity migration triggers despawn-respawn semantics on the new cluster. Test this case explicitly — Rapier's state_round_trips_through_despawn_respawn and contact_events_do_not_carry_across_handoff tests pin this. The Unreal equivalent: when UE Dedicated Server N takes ownership of an entity from Server M (which might be running Rapier — heterogeneous deployments), an AActor on M is destroyed and a fresh AActor is constructed on N from the wire-format EntityStateEntry. Cross-engine migration is the same code path.

Reference

  • Parent EPIC: #8 — Cluster physics backends — Unreal (Chaos) first, multi-engine path. This Unreal-cluster-node epic is the concrete implementation track for that umbrella.
  • Sibling implementation track: #117 (Rapier minimum integration), #118 (Rapier contact events + colliders), #123 (Rapier PR), #120 (Rapier spawn hooks), #121 (Rapier in-tick ops), #122 (Rapier gap inventory).
  • Strategic context: #33 — Engine-specific node types — heterogeneous physics tiers. Unreal Cluster Node is the "engine-parity tier" in that vision.
  • Architectural prerequisites (already landed):
  • Cross-cutting (not blocking):
    • #119 — Terrain epic. Implementation will use UE World Partition for the cluster's chunks-needed loading.
    • Spatial-binding clustering epic (not yet filed) — Fixed-body entities migrate by chunk ownership, not by PGP affinity.
  • Existing UE-side code: arcane-client-unreal — currently a UE5 client plugin; may or may not be the right home for cluster-server code (decision in ADR).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions