Add formal API-level memory consistency model by joalsop · Pull Request #570 · openshmem-org/specification

joalsop · 2026-01-26T20:51:17Z

Add a section formally defining memory ordering guaranteed by the OpenSHMEM API. This formalization section describes how a formal PL-level memory model like C++ can be extended to define the API-level memory semantics provided by OpenSHMEM. This adds clarity and reduces ambiguity about how OpenSHMEM interacts with the memory system, which is increasingly important as the API is implemented on accelerators with SW managed coherence

Summary of changes

Proposal Checklist

Link to issue(s)
Changelog entry
Reviewed for changes to front matter
Reviewed for changes to back matter

Add a section formally defining memory ordering guaranteed by the OpenSHMEM API. This formalization section describes how a formal PL-level memory model like C++ can be extended to define the API-level memory semantics provided by OpenSHMEM. This adds clarity and reduces ambiguity about how OpenSHMEM interacts with the memory system, which is increasingly important as the API is implemented on accelerators with SW managed coherence

fix some formatting and grammar errors

joalsop · 2026-02-05T17:48:24Z

content/api_mem_model.tex

+
+\textbf{Axioms}
+
+\begin{itemize}


These axioms are drawn from the axiomatic formalization of the C++ model as described by Batty et al. in "Overhauling SC Atomics in C11 and OpenCL", 2016 (https://johnwickerson.github.io/papers/openclmm.pdf) - not sure if citation is warranted, and if so, how best to cite.

joalsop · 2026-02-05T18:00:39Z

content/api_mem_model.tex

+
+Remote delivery order relates accesses separated by an \openshmem fence operation. However, this relation does not establish full connectivity; it only forms a connection between observable accesses to the same PE from operations \textit{hb} ordered before and after the fence, and between \textit{hb} ordered accesses before a fence and observable accesses from operations \textit{hb} ordered after a fence. Specifically, $(A, B) \in rdo$ only if there is an \openshmem fence operation $Fop$ and any of the following conditions are true:
+\begin{itemize}
+    \item $A$ is a memory access that is \textit{hb} ordered before $Fop$, $B$ is an observable access issued by an operation that is \textit{hb} ordered after $Fop$, and the context associated with $Fop$ does NOT have the $OPENSHMEM\_CTX\_NOSTORE$ option enabled.


Note: As currently defined, a fence is required to establish "release" ordering semantics for a synchronizing remote write (e.g., signal_set), established via rdo. In contrast, "acquire" ordering semantics are implicit for synchronizing local reads (e.g., wait_until), established via lco.

joalsop · 2026-02-05T18:03:56Z

content/api_mem_model.tex

+
+\textbf{qb-ordered operations}: blocking or nonblocking put, nonblocking get, blocking or nonblocking atomic memory operation, blocking or nonblocking put\_signal.
+
+\subsubsubsection{API Synchronizes-With (asw)}


Note: As currently defined, asw is only established between observable operation accesses. Thus, we can't use PL-level local atomics to establish happens-before ordering between PEs (i.e., a signal_set() cannot synchronize with an atomicLD(memory_order=acquire).

joalsop · 2026-02-05T18:56:01Z

content/api_mem_model.tex

+
+Remote delivery order relates accesses separated by an \openshmem fence operation. However, this relation does not establish full connectivity; it only forms a connection between observable accesses to the same PE from operations \textit{hb} ordered before and after the fence, and between \textit{hb} ordered accesses before a fence and observable accesses from operations \textit{hb} ordered after a fence. Specifically, $(A, B) \in rdo$ only if there is an \openshmem fence operation $Fop$ and any of the following conditions are true:
+\begin{itemize}
+    \item $A$ is a memory access that is \textit{hb} ordered before $Fop$, $B$ is an observable access issued by an operation that is \textit{hb} ordered after $Fop$, and the context associated with $Fop$ does NOT have the $OPENSHMEM\_CTX\_NOSTORE$ option enabled.


Not specified in the spec, but in order to enable sensible release-style semantics for synchronizing openshmem remote stores, fences should order all prior memory accesses (not just stores)

I don't think that is an expected behavior in this case one should use barrier_all (If I undertand the case you discuss correctly: e.g.
0: LD(A); fence(); signal_set(pe=1, val, 1)
1: wait_until(val, 1); put(pe=0, A, newval)
I don't think this implies LD(A) is protected from reading newval, but unsure.

joalsop · 2026-02-05T18:58:36Z

content/api_mem_model.tex

+    \item $A$ is an observable access issued by a fence-ordered operation (defined below) that is \textit{hb} ordered before $Fop$, $B$ is an observable access issued by an operation that is \textit{hb} ordered after $Fop$, and $A$ and $B$ target the same PE.
+\end{itemize}
+
+\textbf{fence-ordered operations}: blocking or nonblocking put, blocking or nonblocking atomic memory operation, blocking or nonblocking put\_signal.


For symmetry, it might be worth defining fences to establish rdo between the remote accesses of all operations, including nonblocking gets. I currently don't see the benefit of preventing rdo ordering for the remote accesses of a nonblocking get (please chime in if you're aware of a pattern or implementation that would benefit). Therefore I suggest we update this to apply to all remote ops.

joalsop · 2026-02-05T19:00:25Z

content/api_mem_model.tex

+    \item $api\_hb := (hb \cup ilv \cup lco \cup rdo \cup rco \cup asw)^+$
+\end{itemize}
+
+In addition, an \textbf{API data race} is defined to relate any two accesses $A$ and $B$ where all of the following are true:


TODO: define data races between synchronizing PL-level atomics and synchronizing shmem operation observable accesses (shmem can only synchronize with shmem atomics, PL can only synchronize with PL atomics)

edit: the above is completed

joalsop · 2026-02-05T19:16:50Z

content/api_mem_model.tex

+    \end{itemize} 
+Many operations have no visible accesses (e.g., memory ordering routines, whose semantics are defined by the API-level relations described in Section~\ref{subsubsec:api_relations}). 
+
+\input{content/api_observable}


Currently, observable accesses are detailed in a large table in this section, but this might not be the best option. Possible changes:

Keep it as a table in this section, but figure out how to format it so it looks nicer

Include the observable accesses within the later sections describing each individual operation

Include as a table in an appendix

define data race to include racy API observable accesses and normal PL-level accesses (including atomics). If these race, behavior is undefined.

joalsop · 2026-02-12T22:14:16Z

content/api_mem_model.tex

+Under the PL-level C++ memory model, events consist of reads (\textit{R}), writes (\textit{W}), atomic read-modify-writes (\textit{RMW}, which perform both a read and write to memory), and fences (\textit{F}). Events that access memory may be atomic or non-atomic. Atomic events are associated with a memory order which can be acquire, release, acquire-release, or relaxed (we leave out consume and sequential consistent memory orders here for simplicity, but this does not impact how the PL-level semantics are extended for the API-level model). The following relations are defined over the events in a candidate execution, and will be used to derive API-level relations in Section~\ref{subsubsec:api_relations}:
+
+\begin{itemize}
+    \item \textbf{Sequenced before (\textit{sb}):} Relates memory events from the same thread based on the order they are evaluated (also called program order in some contexts)


Would it be more readable to expand these wherever used? same goes for newly introduced relation abbreviations. Can still set these terms apart from less formal prose using italics.

fix a few instances of errors identified by reviewers

joalsop · 2026-02-16T22:17:49Z

content/api_mem_model.tex

+
+Second, since the \openshmem API does not define the underlying PL-level implementation, API operations do not resolve to a function consisting of PL-level events. Thus, these operations must be treated as a new type of primitive in the API-level memory model: the operation event. An operation event specifies its type, and each type defines the set of minimum observable accesses triggered by the operation (described next). Operation events participate in \textit{sb} and derived relations, similar to normal thread access events. Although operation events may be associated with separate memory access events (see Section~\ref{subsubsec:api_observable}), the operation events themselves do not participate directly in access-based relations like \textit{mo}, \textit{rf}, and \textit{sw}.
+
+\subsubsection{Per-Operation Observable Accesses}\label{subsubsec:api_observable}


is there a better term for this? possibly "internal accesses"?

markbrown314 · 2026-02-17T15:55:42Z

content/api_mem_model.tex

-    \item $acyclic(api\_hb)$: \textit{hb} cannot have any cycles.
-    \item $irreflexive(rf;api\_hb)$: \textit{hb} must be consistent with the \textbf{rf} relation.
-    \item $irreflexive((rf^{-1})^{?};mo;rf^?;api\_hb)$: \textit{hb} must be consistent with the \textit{mo} order of writes, as well as any reads that read from writes in mo order (often called coherence requirement).
+    \item $acyclic(api\_hb)$: \textit{api_hb} cannot have any cycles.


I cannot build specification without the escape for the underscore in {api_hb}

{api_hb} -> {api\_hb}

thanks - clearly I neglected to test that last fix commit - will apply that shortly

markbrown314 · 2026-02-17T15:55:56Z

content/api_mem_model.tex

-    \item $irreflexive(rf;api\_hb)$: \textit{hb} must be consistent with the \textbf{rf} relation.
-    \item $irreflexive((rf^{-1})^{?};mo;rf^?;api\_hb)$: \textit{hb} must be consistent with the \textit{mo} order of writes, as well as any reads that read from writes in mo order (often called coherence requirement).
+    \item $acyclic(api\_hb)$: \textit{api_hb} cannot have any cycles.
+    \item $irreflexive(rf;api\_hb)$: \textit{api_hb} must be consistent with the \textbf{rf} relation.


{api_hb} -> {api\_hb}

markbrown314 · 2026-02-17T15:56:07Z

content/api_mem_model.tex

-    \item $irreflexive((rf^{-1})^{?};mo;rf^?;api\_hb)$: \textit{hb} must be consistent with the \textit{mo} order of writes, as well as any reads that read from writes in mo order (often called coherence requirement).
+    \item $acyclic(api\_hb)$: \textit{api_hb} cannot have any cycles.
+    \item $irreflexive(rf;api\_hb)$: \textit{api_hb} must be consistent with the \textbf{rf} relation.
+    \item $irreflexive((rf^{-1})^{?};mo;rf^?;api\_hb)$: \textit{api_hb} must be consistent with the \textit{mo} order of writes, as well as any reads that read from writes in mo order (often called coherence requirement).


{api_hb} -> {api\_hb}

markbrown314

Added a comment about escapes for api_hb.

abouteiller · 2026-02-13T16:07:25Z

content/api_mem_model.tex

+    \item \textbf{Sequenced before (\textit{sb}):} Relates memory events from the same thread based on the order they are evaluated (also called program order in some contexts)
+    \item \textbf{Modification order (\textit{mo}):} A total ordering between all writes to the same address.
+    \item \textbf{Reads-from (\textit{rf}):} Relates a write \textit{W} and any read \textit{R} to the same address such that \textit{R} returns the value written by \textit{W}.
+    \item \textbf{Synchronizes-with (\textit{sw}):} Same as \textit{rf}, but also requires that \textit{W} and \textit{R} are atomic, \textit{W} has release or acquire-release memory order, and \textit{R} has acquire or acquire-release memory order (this relation is also called synchronization order in some contexts).


I know this term is used in C++ but It overloads the synchronize term in shmem, maybe a different term would be preferred, e.g. acquire-release pair?

In addition, there is no definition for acquire-release memory orders, a short reminder may be useful.

I know this term is used in C++ but It overloads the synchronize term in shmem, maybe a different term would be preferred, e.g. acquire-release pair?

I see that "synchronize" is used in the current spec for defining bulk synchronize patterns (e.g., barriers), but I'm hesitant to redefine the existing teminology in C++ when describing the baseline memory model - I'm worried that will add confusion. I did not find the specific term "synchronize[s] with" anywhere in the existing spec - maybe we can explicitly note that this term refers to the C++ definition when used in this section and set it out with italics?

abouteiller · 2026-02-13T16:14:32Z

content/api_mem_model.tex

+
+\subsubsection{PEs and Operation Events}\label{subsubsec:api_pes_operations}
+
+When defining behavior at the \openshmem API level, we must first consider two key differences in program execution relative to a PL-level model. First, \openshmem applications span multiple PEs, each of which has its own local memory space. Therefore, the PL-level definition of memory address is extended to include both the PE-local address and the PE ID. Normal memory accesses can only access addresses on the PE local to the thread that issued it, which is constant for the life of a thread. \openshmem operations may trigger accesses to the local PE’s memory and/or the memory of remote PEs. Dynamic memory accesses in an execution are only considered to access the same address if both the address and PE ID attributes match (impacting the definitions of of \textit{mo} and \textit{rf}).


Normal memory accesses can only access addresses on the PE local to the thread that issued it, which is constant for the life of a thread.

The thread may access all local memory (there is no notion of thread-local memory in shmem afaik), but coherency of accesses must be enforced by PL level memory model.

here I was just trying to precisely specify "local memory" - but maybe overly verbose - if it's precise enough to say "local memory", I can make that change

abouteiller · 2026-02-13T16:22:33Z

content/api_mem_model.tex

+\subsubsubsection{Local Completion Order (lco)}
+Local completion order relates local observable accesses from a blocking \openshmem operation and subsequent \textit{hb} ordered accesses and operations from the initiating PE. This ensures, for example, that a blocking put completes its accesses to the source buffer before a subsequent store from the initiating thread overwrites the data. Specifically, $(A, B) \in lco$ only if all of the following are true:
+\begin{itemize}
+    \item $A$ is an observable access issued from a blocking operation event (i.e., an operation that does not have nbi in its name).


some operations are implicit nbi (e.g. nonfetching atomics) so the statement about not having nbi in the name is imprecise (but helpful for comprehension).

non-fetching atomics don't access local memory, so lco does not apply anyways - are there any operations that are non-blocking, access local memory, and don't have nbi in the name?

abouteiller · 2026-02-13T16:26:49Z

content/api_mem_model.tex

+
+Remote delivery order relates accesses separated by an \openshmem fence operation. However, this relation does not establish full connectivity; it only forms a connection between observable accesses to the same PE from operations \textit{hb} ordered before and after the fence, and between \textit{hb} ordered accesses before a fence and observable accesses from operations \textit{hb} ordered after a fence. Specifically, $(A, B) \in rdo$ only if there is an \openshmem fence operation $Fop$ and any of the following conditions are true:
+\begin{itemize}
+    \item $A$ is a memory access that is \textit{hb} ordered before $Fop$, $B$ is an observable access issued by an operation that is \textit{hb} ordered after $Fop$, and the context associated with $Fop$ does NOT have the $OPENSHMEM\_CTX\_NOSTORE$ option enabled.


I don't think that is an expected behavior in this case one should use barrier_all (If I undertand the case you discuss correctly: e.g.
0: LD(A); fence(); signal_set(pe=1, val, 1)
1: wait_until(val, 1); put(pe=0, A, newval)
I don't think this implies LD(A) is protected from reading newval, but unsure.

joalsop · 2026-02-18T00:01:20Z

content/api_mem_model.tex

+
+Although \openshmem may be implemented in multiple programming languages, this section describes how \openshmem extends the C++ memory consistency model. C++ is chosen as an example because of its generality, and because it forms the basis for common GPU programming languages (CUDA, hip). We describe how the guarantees described here can be adapted for GPU memory models in Section~\ref{subsubsec:api_scoped}.
+
+\subsubsection{Baseline PL-level Axiomatic Memory Model (C++)}\label{subsubsec:api_pl_baseline}


Can explicitly cite C++ standard here - similar to in 9.6 page 55 in spec version 1.6. Should maybe more explicitly say this is not meant to be a substitute for the full specification.

We could also reference/base the memory model on C11 instead of C++ since the rest of the spec relies on this - however, C++ is useful because GPU languages are based on it. Curious what others think.

joalsop · 2026-02-18T00:06:51Z

content/api_mem_model.tex

+This fused operation and the associated observable accesses (if any) are considered to be executed collectively by all nodes in the synchronizing set of PEs, allowing us to establish ordering relations across these PEs (defined in Section~\ref{subsubsec:api_relations}).
+This does not mean the observable accesses must occur atomically, and it does not constrain how the operation is implemented (i.e., any thread or PE in the team may perform each observable access).
+
+\subsubsection{API-Level Relations}\label{subsubsec:api_relations}


It's been noted that currently the api-level relations are grouped based on the operation that generates them, but this leads to some non-intuitive implications (e.g., local completion order currently establishes ordering only for blocking ops, but local completion can also be established with a quiet). We could reorganize relations based on the type of ordering established (e.g., include quiet-established local completion as part of lco) or alternatively think of a different naming scheme...

gonzalobg · 2026-02-19T11:08:10Z

content/api_mem_model.tex

+
+\subsubsection{Per-Operation Observable Accesses}\label{subsubsec:api_observable}
+
+For the purposes of the API memory model, each \openshmem operation type defines the minimum set of observable memory accesses performed by the operation. These accesses are considered to be "issued" by the associated. These accesses must occur as part of the operation, however there is no guarantee that they will be executed by the initiating thread, or that they will be the only accesses performed by the API operation implementation. Observable accesses have the same semantics as normal accesses and take part in the same relations, with one exception: there is no \textit{sb} relationship between observable accesses and the initiating thread of an API operation. Accesses performed by an implementation that are not included in the observable set must not modify data that may be observed from application code external to \openshmem operation implementations. 


These accesses are considered to be "issued" by the associated [something is missing].

-text edits for clarity, consistency with the rest of the spec and the c++ spec (e.g., address->memory location, brief prose description of acquire/release semantics...) -fix unescaped underscore bug

lstewart · 2026-02-19T18:02:07Z

content/api_mem_model.tex

+
+The C++ memory model is defined axiomatically in the C++ specification. Although many prior works aim to mathematize or improve this formalism, here we simply try to provide sufficient understanding to build upon it with the \openshmem API model.
+
+Under the PL-level C++ memory model, events consist of reads (\textit{R}), writes (\textit{W}), atomic read-modify-writes (\textit{RMW}, which perform both a read and write to memory), and fences (\textit{F}). Events that access memory may be atomic or non-atomic. Atomic events are associated with a memory order which can be acquire, release, acquire-release, or relaxed (we leave out consume and sequential consistent memory orders here for simplicity, but this does not impact how the PL-level semantics are extended for the API-level model). The following relations are defined over the events in a candidate execution, and will be used to derive API-level relations in Section~\ref{subsubsec:api_relations}:


What does PL-level mean?

the programming language-level C11 memory model

lstewart · 2026-02-19T18:11:20Z

content/api_mem_model.tex

+\begin{itemize}
+    \item $acyclic(hb)$: \textit{hb} cannot have any cycles.
+    \item $irreflexive(rf;hb)$: \textit{hb} must be consistent with the \textbf{rf} relation.
+    \item $irreflexive((rf^{-1})^{?};mo;rf^?;hb)$: \textit{hb} must be consistent with the \textit{mo} order of writes, as well as any reads that read from writes in mo order (often called coherence requirement).


It would likely be useful to know who is reading or writing. Are these relations always referring to different PEs or are they referring to different threads in the same PE or what? In the GPU case, are we covering relations between CPU threads and GPU threads? CPU-CPU only? GPU-GPU only?

as described here, this applies to any threads that are using the C11 memory model. However, a later section describes how it is applicable to scoped memory models like hip, CUDA, HSA, etc. which may have CPU and/or GPU thread. These memory models are based on C11, but add scopes. The distinction of which devices are synchronizing impacts which scopes may be used for legal synchronization, but does not impact the semantics of the API-level model (any scoped synchronization will occur transparently within the implementation of the API call).

lstewart · 2026-02-19T18:13:07Z

content/api_mem_model.tex

+
+To precisely define what behavior the \openshmem interface guarantees for both users and implementers, we formally define an API-level memory model. An API-level model builds upon a programming language-level (PL-level) memory model, and it defines how API operations impact memory. This is useful for determining the ordering and visibility requirements for \openshmem memory accesses when implemented on any device, but it is of particular importance for devices like GPUs that have software managed caches requiring expensive bulk flush or invalidate actions to enforce memory ordering guarantees. Without a precisely specified memory model, API implementers and users may disagree on what behavior is guaranteed by the API, and this can lead to poor performance (too many bulk cache actions) or buggy code (too few bulk cache actions).
+
+Although \openshmem may be implemented in multiple programming languages, this section describes how \openshmem extends the C++ memory consistency model. C++ is chosen as an example because of its generality, and because it forms the basis for common GPU programming languages (CUDA, hip). We describe how the guarantees described here can be adapted for GPU memory models in Section~\ref{subsubsec:api_scoped}.


Maybe either mention all the GPU languages (SYCL, Vulkan, OpenCL) or none of them, or just say "for example"

lstewart · 2026-02-19T18:23:23Z

content/api_observable.tex

+  ST R1 -> fetch, my_pe
+    \end{verbatim}
+    \tabularnewline\cline{2-2}
+    & \begin{verbatim}shmem_atomic_swap_nbi(fetch, dest, cond, value, 


Does swap_nbi really have a cond argument?

nope - thanks for catching that

lstewart · 2026-02-19T19:40:57Z

content/api_mem_model.tex

+    \item The operation event that issues $A$ is \textit{hb} ordered before $B$, or is \textit{hb} ordered before an operation event that issues $B$.
+\end{itemize}
+
+\subsubsubsection{Remote Delivery Order (rdo)}


rfo doesn't apply to put_signal? Or is that defined elsewhere. I naively think put, fence, put is equivalent to put_signal

currently the ordering between the put and the signal in put_signal is defined as an sb relationship between every put access and the final signal store (see the conditions for sb in the observable accesses section) - so it will be slightly different from put fence signal_set in that it only establishes ordering between the accesses of the put_signal operation, and doesn't establish any ordering between all prior and successive accesses like a normal fence (e.g., if you have put, put, put_signal, only the final put will have guaranteed ordering with the signal op). However, this was one area where the existing spec was not clear - I'm open to defining it differently.

lstewart · 2026-02-19T19:47:41Z

content/api_mem_model.tex

+    \item $A$ is an observable access issued by a fence-ordered operation (defined below) that is \textit{hb} ordered before $Fop$, $B$ is an observable access issued by an operation that is \textit{hb} ordered after $Fop$, and $A$ and $B$ target the same PE.
+\end{itemize}
+
+\textbf{fence-ordered operations}: blocking or nonblocking put, blocking or nonblocking atomic memory operation, blocking or nonblocking put\_signal.


In GPU implementations, there are API calls made on the GPU, and API calls made on the host, and they are not completely ordered w.r.t. each other, Often, ordering is established only at kernel boundaries. I think it is technically feasible to have a GPU initiated FENCE affect host behavior in current implementations, but I am not sure that host FENCE can affect GPU behavior.
There are proposals to make host initiated and gpu initiated be different PEs, or sub-PEs or something, but this is not resolved.

The intent here is to allow a fence on a GPU to order operations on the CPU if they have established hb order between the threads in some way, and they use the same context. this may be valuable at some point - but if a programmer doesn't want that, then using separate contexts or separate PEs as you say would be a way to get around it I think.

lstewart · 2026-02-19T19:54:24Z

content/api_mem_model.tex

+
+\subsubsubsection{API Synchronizes-With (asw)}\label{subsubsubsec:api_relations_asw}
+
+API synchronizes-with is the API-level analog of the PL-level \textit{sw} relation. It relates observable accesses from legally synchronizing \openshmem operations that form a \textit{rf} relationship. Specifically, $(A, B) \in asw$ only if $(A, B) \in rf$ and any of the following are true:


This is the only occurance of "rf$" I can find. What should it be? I also see one asw$ so this is some markup thing I don't understand?

this refers to the reads-from relation in the underyling PL-level model - the $ is just part of the math formatting in latex

lstewart · 2026-02-19T19:58:11Z

content/api_mem_model.tex

+    \item $B$ is an observable write or read-modify-write access issued by a synchronizing \openshmem operation.
+\end{itemize}
+
+In the above, "synchronizing \openshmem operations" are defined as any operation from the following categories: atomic memory operations, signaling operations, point-to-point synchronization routings, or distributed locking routines.


Do you mean to exclude shmem_sync here? or quiet + sync? or barrier?

Yes - these do not have observable accesses - so they can't form a synchronizing relationship as defined here

lstewart · 2026-02-19T20:13:06Z

content/api_mem_model.tex

+    \item $acyclic(api\_hb)$: \textit{api_hb} cannot have any cycles.
+    \item $irreflexive(rf;api\_hb)$: \textit{api_hb} must be consistent with the \textbf{rf} relation.
+    \item $irreflexive((rf^{-1})^{?};mo;rf^?;api\_hb)$: \textit{api_hb} must be consistent with the \textit{mo} order of writes, as well as any reads that read from writes in mo order (often called coherence requirement).
+    \item $irreflexive(rf;(sb^{amo\_op})^? \cup mo;mo;(;sb^{amo\_op})^?;rf^-1 \cup mo;rf;(sb^{amo\_op})^?$ where $sb^{amo\_op}$ relates $sb$ ordered observable accesses issued by the same \openshmem atomic memory operation: The read of an atomic read-modify-write access must always read the last value (in the modification order) written before the write of the read-modify-write access. Similarly, the observable read of an \openshmem atomic memory operation must always read the last value (in the modification order) written before the observable write of the atomic memory operation.


According to the spec, shmem atomics are only atomic w.r.t. other shmem atomics. This statement is far stricter, and doesn't match up with implementations using NIC atomics.

Currently we avoid interactions/guarantees of atomicity between shmem atomics and "normal" PL-level atomics by defining them as a data race in the API-level model. That is, if shmem atomics and normal atomics race, that forms a data race. Does that address this concern? This is an area that I'm less familiar with, so open to alternative solutions.

My understanding is that SHMEM users really should not mix atomic and non atomic references to the same addresses.

lstewart · 2026-02-19T20:19:34Z

content/api_observable.tex

+  return R1
+    \end{verbatim}
+    \tabularnewline\cline{2-2}
+    & \begin{verbatim}shmem_atomic_swap(dest, cond, value, pe):


Doesn't have cond

content/api_mem_model.tex

Use C11 as the PL-level basis for the formal API-level model. This is consistent with the rest of the document, and differs from the C++ model in no relevant ways.

Minor fixes to observable accesses, grammar, and formatting for consistency.

wfaderhold21 · 2026-02-27T18:25:43Z

content/api_mem_model.tex

+
+To precisely define what behavior the \openshmem interface guarantees for both users and implementers, we formally define an API-level memory model. An API-level model builds upon a programming language-level (PL-level) memory model, and it defines how API operations impact memory. This is useful for determining the ordering and visibility requirements for \openshmem memory accesses when implemented on any device, but it is of particular importance for devices like GPUs that have software managed caches requiring expensive bulk flush or invalidate actions to enforce memory ordering guarantees. Without a precisely specified memory model, API implementers and users may disagree on what behavior is guaranteed by the API, and this can lead to poor performance (too many bulk cache actions) or buggy code (too few bulk cache actions).
+
+This section describes how \openshmem extends the C11 memory consistency model. C11 is chosen as an example because of its generality and ubiquity in the CPU domain, and because it forms the basis for common GPU programming languages (for example, CUDA, hip). Section~\ref{subsubsec:api_pl_baseline} summarizes the baseline C11 memory model, Sections~\ref{subsubsec:api_model_overview}-\ref{subsubsec:api_integrate} describe how this programming language-level model is extended to provide API-level semantics for \openshmem, and Section~\ref{subsubsec:api_scoped} describes how the guarantees of the API-level model can be adapted for GPU memory models.


This section - > This Section
?

wfaderhold21 · 2026-02-27T18:40:26Z

content/api_mem_model.tex

+
+\subsubsection{API-Level Memory Model Overview}\label{subsubsec:api_model_overview}
+
+The \openshmem API-level memory model extends PL-level memory models like C11 to define how \openshmem operations interact with memory. While PL-level models operate on fine-grain loads, stores, and fences, the API-level memory model must additionally operate on coarse-grain API operations like get, put, barrier, etc. that can execute asynchronously and involve multiple PEs.


Aren't there macros for get / put ? also PEs -> \acp{PE}

joalsop added 2 commits January 26, 2026 14:41

formal api model: minor fixes

74b3f29

fix some formatting and grammar errors

joalsop commented Feb 5, 2026

View reviewed changes

formal api model: update data race definition

d48a22d

define data race to include racy API observable accesses and normal PL-level accesses (including atomics). If these race, behavior is undefined.

joalsop commented Feb 12, 2026

View reviewed changes

formal api model: minor fixes and clarifications

27a272f

fix a few instances of errors identified by reviewers

joalsop commented Feb 16, 2026

View reviewed changes

markbrown314 reviewed Feb 17, 2026

View reviewed changes

abouteiller reviewed Feb 17, 2026

View reviewed changes

joalsop commented Feb 18, 2026

View reviewed changes

gonzalobg reviewed Feb 19, 2026

View reviewed changes

formal api model: minor fixes and edits for clarity

b8b4233

-text edits for clarity, consistency with the rest of the spec and the c++ spec (e.g., address->memory location, brief prose description of acquire/release semantics...) -fix unescaped underscore bug

lstewart reviewed Feb 19, 2026

View reviewed changes

wfaderhold21 reviewed Feb 23, 2026

View reviewed changes

content/api_mem_model.tex Show resolved Hide resolved

joalsop added 2 commits February 26, 2026 18:00

formal api model: C++ -> C11

3aadcbd

Use C11 as the PL-level basis for the formal API-level model. This is consistent with the rest of the document, and differs from the C++ model in no relevant ways.

formal api memory model: minor fixes, formatting updates

d00c0f0

Minor fixes to observable accesses, grammar, and formatting for consistency.

wfaderhold21 reviewed Feb 27, 2026

View reviewed changes


		\textbf{qb-ordered operations}: blocking or nonblocking put, nonblocking get, blocking or nonblocking atomic memory operation, blocking or nonblocking put\_signal.

		\subsubsubsection{API Synchronizes-With (asw)}


		Second, since the \openshmem API does not define the underlying PL-level implementation, API operations do not resolve to a function consisting of PL-level events. Thus, these operations must be treated as a new type of primitive in the API-level memory model: the operation event. An operation event specifies its type, and each type defines the set of minimum observable accesses triggered by the operation (described next). Operation events participate in \textit{sb} and derived relations, similar to normal thread access events. Although operation events may be associated with separate memory access events (see Section~\ref{subsubsec:api_observable}), the operation events themselves do not participate directly in access-based relations like \textit{mo}, \textit{rf}, and \textit{sw}.

		\subsubsection{Per-Operation Observable Accesses}\label{subsubsec:api_observable}


		\subsubsection{PEs and Operation Events}\label{subsubsec:api_pes_operations}

		When defining behavior at the \openshmem API level, we must first consider two key differences in program execution relative to a PL-level model. First, \openshmem applications span multiple PEs, each of which has its own local memory space. Therefore, the PL-level definition of memory address is extended to include both the PE-local address and the PE ID. Normal memory accesses can only access addresses on the PE local to the thread that issued it, which is constant for the life of a thread. \openshmem operations may trigger accesses to the local PE’s memory and/or the memory of remote PEs. Dynamic memory accesses in an execution are only considered to access the same address if both the address and PE ID attributes match (impacting the definitions of of \textit{mo} and \textit{rf}).


		Although \openshmem may be implemented in multiple programming languages, this section describes how \openshmem extends the C++ memory consistency model. C++ is chosen as an example because of its generality, and because it forms the basis for common GPU programming languages (CUDA, hip). We describe how the guarantees described here can be adapted for GPU memory models in Section~\ref{subsubsec:api_scoped}.

		\subsubsection{Baseline PL-level Axiomatic Memory Model (C++)}\label{subsubsec:api_pl_baseline}


		\subsubsection{Per-Operation Observable Accesses}\label{subsubsec:api_observable}

		For the purposes of the API memory model, each \openshmem operation type defines the minimum set of observable memory accesses performed by the operation. These accesses are considered to be "issued" by the associated. These accesses must occur as part of the operation, however there is no guarantee that they will be executed by the initiating thread, or that they will be the only accesses performed by the API operation implementation. Observable accesses have the same semantics as normal accesses and take part in the same relations, with one exception: there is no \textit{sb} relationship between observable accesses and the initiating thread of an API operation. Accesses performed by an implementation that are not included in the observable set must not modify data that may be observed from application code external to \openshmem operation implementations.


		The C++ memory model is defined axiomatically in the C++ specification. Although many prior works aim to mathematize or improve this formalism, here we simply try to provide sufficient understanding to build upon it with the \openshmem API model.

		Under the PL-level C++ memory model, events consist of reads (\textit{R}), writes (\textit{W}), atomic read-modify-writes (\textit{RMW}, which perform both a read and write to memory), and fences (\textit{F}). Events that access memory may be atomic or non-atomic. Atomic events are associated with a memory order which can be acquire, release, acquire-release, or relaxed (we leave out consume and sequential consistent memory orders here for simplicity, but this does not impact how the PL-level semantics are extended for the API-level model). The following relations are defined over the events in a candidate execution, and will be used to derive API-level relations in Section~\ref{subsubsec:api_relations}:


		To precisely define what behavior the \openshmem interface guarantees for both users and implementers, we formally define an API-level memory model. An API-level model builds upon a programming language-level (PL-level) memory model, and it defines how API operations impact memory. This is useful for determining the ordering and visibility requirements for \openshmem memory accesses when implemented on any device, but it is of particular importance for devices like GPUs that have software managed caches requiring expensive bulk flush or invalidate actions to enforce memory ordering guarantees. Without a precisely specified memory model, API implementers and users may disagree on what behavior is guaranteed by the API, and this can lead to poor performance (too many bulk cache actions) or buggy code (too few bulk cache actions).

		Although \openshmem may be implemented in multiple programming languages, this section describes how \openshmem extends the C++ memory consistency model. C++ is chosen as an example because of its generality, and because it forms the basis for common GPU programming languages (CUDA, hip). We describe how the guarantees described here can be adapted for GPU memory models in Section~\ref{subsubsec:api_scoped}.


		\subsubsubsection{API Synchronizes-With (asw)}\label{subsubsubsec:api_relations_asw}

		API synchronizes-with is the API-level analog of the PL-level \textit{sw} relation. It relates observable accesses from legally synchronizing \openshmem operations that form a \textit{rf} relationship. Specifically, $(A, B) \in asw$ only if $(A, B) \in rf$ and any of the following are true:


		To precisely define what behavior the \openshmem interface guarantees for both users and implementers, we formally define an API-level memory model. An API-level model builds upon a programming language-level (PL-level) memory model, and it defines how API operations impact memory. This is useful for determining the ordering and visibility requirements for \openshmem memory accesses when implemented on any device, but it is of particular importance for devices like GPUs that have software managed caches requiring expensive bulk flush or invalidate actions to enforce memory ordering guarantees. Without a precisely specified memory model, API implementers and users may disagree on what behavior is guaranteed by the API, and this can lead to poor performance (too many bulk cache actions) or buggy code (too few bulk cache actions).

		This section describes how \openshmem extends the C11 memory consistency model. C11 is chosen as an example because of its generality and ubiquity in the CPU domain, and because it forms the basis for common GPU programming languages (for example, CUDA, hip). Section~\ref{subsubsec:api_pl_baseline} summarizes the baseline C11 memory model, Sections~\ref{subsubsec:api_model_overview}-\ref{subsubsec:api_integrate} describe how this programming language-level model is extended to provide API-level semantics for \openshmem, and Section~\ref{subsubsec:api_scoped} describes how the guarantees of the API-level model can be adapted for GPU memory models.


		\subsubsection{API-Level Memory Model Overview}\label{subsubsec:api_model_overview}

		The \openshmem API-level memory model extends PL-level memory models like C11 to define how \openshmem operations interact with memory. While PL-level models operate on fine-grain loads, stores, and fences, the API-level memory model must additionally operate on coarse-grain API operations like get, put, barrier, etc. that can execute asynchronously and involve multiple PEs.


		\textbf{Axioms}

		\begin{itemize}

Conversation

joalsop commented Jan 26, 2026

Summary of changes

Proposal Checklist

Uh oh!

joalsop Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joalsop Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joalsop Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

markbrown314 Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

markbrown314 Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

markbrown314 Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

markbrown314 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joalsop Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gonzalobg Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joalsop Feb 5, 2026 •

edited

Loading

joalsop Feb 5, 2026 •

edited

Loading

joalsop Feb 5, 2026 •

edited

Loading

markbrown314 Feb 17, 2026 •

edited

Loading

markbrown314 Feb 17, 2026 •

edited

Loading

markbrown314 Feb 17, 2026 •

edited

Loading

joalsop Feb 18, 2026 •

edited

Loading

gonzalobg Feb 19, 2026 •

edited

Loading