ipc: use shared memory for large events#972
Draft
matthew-levan wants to merge 46 commits intoml/64from
Draft
Conversation
…om ~mastyr-bottec
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Shared Memory Plea Protocol
Large pokes (file system commits, etc.) sent from urth to mars previously
travelled over the Unix pipe using the standard newt/jam/cue path. For
payloads above ~256 MiB this caused severe memory pressure and, for payloads
approaching 2 GiB, a process segfault.
This document describes the replacement: a POSIX shared memory fast-path that
bypasses the pipe for large events, copies the raw loom noun structure directly
(no jam/cue), and keeps peak memory well within reach of a 16 GiB machine.
Problem
The standard path for sending an event from urth to mars is:
~2 GiBC-heap buffer (dat_y)dat_yover the pipe (5-byte-header newt framing)For a 2 GiB file commit this requires:
~2 GiBC-heap for the jammed bytes~2 GiBC-heap for the cue dictionary (transient) +~2 GiBloom forthe decoded noun +
~2 GiBloom for the re-jammed LMDB event =~6 GiBpeakDesign
Protocol
A new
%pleamessage type is added to the urth↔mars IPC protocol (alongside theexisting
%poke,%peek,%live, etc.):The normal
%pokeresponse from mars back to urth is unchanged; the plea writis converted to a poke writ in-place before
%doneis sent so that mars'seventual
[%poke ...]reply matches through the standard writ-queue path.Threshold
The plea path is taken when the serialized noun size exceeds
_UNIX_PLEA_THRESHOLD(256 MiB), currently triggered from
_unix_update_mountinpkg/vere/io/unix.c.Shared memory ownership
shm_open+ftruncate), sends the name tourth, then waits for
%done.then
munmaps and sends%done. Urth never owns the shm region.%done, mmaps the shm read-only, deserializes the noun, thenmunmaps andshm_unlinks before continuing.Noun serialization: raw loom copy (no jam/cue)
Instead of jam/cue, the shm buffer holds a compact binary encoding of the raw
loom noun structure, implemented in
pkg/noun/allocate.c:u3a_noun_shm_size(u3_noun som) → c3_d: DFS traversal counting bytes needed. Handles DAG sharing via a
ur_dict64_t(loom offset → sentinel). Returns total byte length includingthe 16-byte header.
u3a_noun_to_shm(u3_noun som, c3_y* shm_y, c3_d cap_d) → c3_d: Iterative post-order DFS. Writes each unique indirect object (atom or cell)
exactly once in child-before-parent order. Returns bytes written.
u3a_noun_from_shm(const c3_y* shm_y, c3_d len_d) → u3_weak: Single-dict two-phase deserializer (see below). Returns the root noun
allocated on the current road, or
u3_noneon error.SHM buffer format
Noun values in shm-offset space use the top two bits as a tag:
00xxxxxxx…— direct atom (fits in 62 bits); stored as-is10xxxxxxx…— indirect atom; low 62 bits = byte offset into data section11xxxxxxx…— cell; low 62 bits = byte offset into data sectionEach atom entry in the data section:
Each cell entry in the data section:
u3a_noun_from_shm: single-dict two-phase approachA single
ur_dict64_tserves both phases, halving peak C-heap vs a two-dictapproach:
shm offset appears as
hed,tel, or root. Storedict[shm_off] = refcount.use_d = dict[shm_off],allocate the loom noun with
use_w = use_d, then overwritedict[shm_off] = loom_noun.This is safe because data is written in post-order: when phase 2 encounters a
cell, both children have already been processed and their dict entries already
hold the resolved loom nouns.
Dict pre-sizing
Large nouns would cause many costly resize generations. Both
u3a_noun_to_shmand
u3a_noun_from_shmpre-size theirur_dict64_tvia_shm_dict_init, whichpicks the smallest fibonacci pair such that the initial bucket count can hold the
estimated node count (
dat_d / _SHM_CELL_SIZE) without resizing. The fibonaccitable in
pkg/ur/defs.hwas extended fromur_fib34throughur_fib36to coverthe required range.
Files Changed
pkg/c3/motes.hc3__pleamotepkg/ur/defs.hur_fib29–ur_fib36pkg/vere/vere.hu3_writ_pleaenum value;pla_ustruct inu3_writunion;u3_lord_plea()declarationpkg/vere/mars.hu3_mars_plea_estate;pla_ustruct inu3_marspkg/vere/lord.c_lord_plea_plea()handler;u3_lord_plea()public API;%pleadispatch in writ machinerypkg/vere/mars.c%pleaand%donecases in_mars_work(); state guard inu3_mars_kick()pkg/vere/io/unix.c_unix_plea_ctx,_unix_plea_fill, plea branch in_unix_update_mount()pkg/noun/allocate.hu3a_noun_shm_size,u3a_noun_to_shm,u3a_noun_from_shmpkg/noun/allocate.c_shm_dict_inithelperTest Results: 2 GiB File Commit
Tested on Apple M-series (ARM64, macOS 26.3), 16 GiB RAM,
--urth-loom 34(16 GiB virtual loom). The commit consisted of a single ~2 GiB binary file
written via the Clay Unix mount. vmmap snapshots taken immediately after
the commit completed (both processes idle, LMDB write in progress).
Urth (31135)
old_to_shmdict from
u3a_noun_to_shm. Before pre-sizing this was 33 regions andwas still live (4.0 GiB) when sampled mid-serialize.
MALLOC_LARGEis the jammed event buffer (u3_feat::hun_yin
disk.c) held pending async LMDB write — unavoidable given the jam-basedevent log.
VM_ALLOCATEis the urth loom's dirty pages (~2 GiB noun +working set), all of which are
MAP_ANON | MAP_PRIVATE.Mars (31136)
u3a_noun_from_shm, pre-sized, few resize generations)u3qe_jam(starts atfib11/fib12, grows through manygenerations for a 2 GiB noun — this is the dominant contributor and is
independent of the plea protocol)
buffer (
dat_y) freed after the async LMDB write completed.DefaultMallocZoneconfirms no malloc leak from theplea/decode path.
Peak breakdown
Comparison: old pipe path vs plea protocol
Known Limitations / Future Work
MALLOC_LARGE (empty)isdominated by
u3qe_jam's internal dict (used when writing the decoded noun tothe event log). Pre-sizing that dict from the known noun size would reduce mars
peak by ~1–2 GiB.
(e.g. 64 MiB) would engage the plea path more aggressively but the per-call
overhead (shm creation, two IPC round-trips) is small.
standard POSIX interfaces (
shm_open,mmap,munmap,shm_unlink) andshould be portable, but has not yet been profiled on Linux.