Skip to content

Add formal API-level memory consistency model#570

Open
joalsop wants to merge 7 commits intoopenshmem-org:mainfrom
joalsop:joalsop/formal_mcm
Open

Add formal API-level memory consistency model#570
joalsop wants to merge 7 commits intoopenshmem-org:mainfrom
joalsop:joalsop/formal_mcm

Conversation

@joalsop
Copy link

@joalsop joalsop commented Jan 26, 2026

Add a section formally defining memory ordering guaranteed by the OpenSHMEM API. This formalization section describes how a formal PL-level memory model like C++ can be extended to define the API-level memory semantics provided by OpenSHMEM. This adds clarity and reduces ambiguity about how OpenSHMEM interacts with the memory system, which is increasingly important as the API is implemented on accelerators with SW managed coherence

Summary of changes

Proposal Checklist

  • Link to issue(s)
  • Changelog entry
  • Reviewed for changes to front matter
  • Reviewed for changes to back matter

Add a section formally defining memory ordering guaranteed
by the OpenSHMEM API. This formalization section describes how
a formal PL-level memory model like C++ can be extended
to define the API-level memory semantics provided by OpenSHMEM.
This adds clarity and reduces ambiguity about how OpenSHMEM interacts
with the memory system, which is increasingly important as
the API is implemented on accelerators with SW managed coherence
fix some formatting and grammar errors

\textbf{Axioms}

\begin{itemize}
Copy link
Author

@joalsop joalsop Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These axioms are drawn from the axiomatic formalization of the C++ model as described by Batty et al. in "Overhauling SC Atomics in C11 and OpenCL", 2016 (https://johnwickerson.github.io/papers/openclmm.pdf) - not sure if citation is warranted, and if so, how best to cite.


Remote delivery order relates accesses separated by an \openshmem fence operation. However, this relation does not establish full connectivity; it only forms a connection between observable accesses to the same PE from operations \textit{hb} ordered before and after the fence, and between \textit{hb} ordered accesses before a fence and observable accesses from operations \textit{hb} ordered after a fence. Specifically, $(A, B) \in rdo$ only if there is an \openshmem fence operation $Fop$ and any of the following conditions are true:
\begin{itemize}
\item $A$ is a memory access that is \textit{hb} ordered before $Fop$, $B$ is an observable access issued by an operation that is \textit{hb} ordered after $Fop$, and the context associated with $Fop$ does NOT have the $OPENSHMEM\_CTX\_NOSTORE$ option enabled.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: As currently defined, a fence is required to establish "release" ordering semantics for a synchronizing remote write (e.g., signal_set), established via rdo. In contrast, "acquire" ordering semantics are implicit for synchronizing local reads (e.g., wait_until), established via lco.


\textbf{qb-ordered operations}: blocking or nonblocking put, nonblocking get, blocking or nonblocking atomic memory operation, blocking or nonblocking put\_signal.

\subsubsubsection{API Synchronizes-With (asw)}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: As currently defined, asw is only established between observable operation accesses. Thus, we can't use PL-level local atomics to establish happens-before ordering between PEs (i.e., a signal_set() cannot synchronize with an atomicLD(memory_order=acquire).


Remote delivery order relates accesses separated by an \openshmem fence operation. However, this relation does not establish full connectivity; it only forms a connection between observable accesses to the same PE from operations \textit{hb} ordered before and after the fence, and between \textit{hb} ordered accesses before a fence and observable accesses from operations \textit{hb} ordered after a fence. Specifically, $(A, B) \in rdo$ only if there is an \openshmem fence operation $Fop$ and any of the following conditions are true:
\begin{itemize}
\item $A$ is a memory access that is \textit{hb} ordered before $Fop$, $B$ is an observable access issued by an operation that is \textit{hb} ordered after $Fop$, and the context associated with $Fop$ does NOT have the $OPENSHMEM\_CTX\_NOSTORE$ option enabled.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not specified in the spec, but in order to enable sensible release-style semantics for synchronizing openshmem remote stores, fences should order all prior memory accesses (not just stores)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that is an expected behavior in this case one should use barrier_all (If I undertand the case you discuss correctly: e.g.
0: LD(A); fence(); signal_set(pe=1, val, 1)
1: wait_until(val, 1); put(pe=0, A, newval)
I don't think this implies LD(A) is protected from reading newval, but unsure.

\item $A$ is an observable access issued by a fence-ordered operation (defined below) that is \textit{hb} ordered before $Fop$, $B$ is an observable access issued by an operation that is \textit{hb} ordered after $Fop$, and $A$ and $B$ target the same PE.
\end{itemize}

\textbf{fence-ordered operations}: blocking or nonblocking put, blocking or nonblocking atomic memory operation, blocking or nonblocking put\_signal.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For symmetry, it might be worth defining fences to establish rdo between the remote accesses of all operations, including nonblocking gets. I currently don't see the benefit of preventing rdo ordering for the remote accesses of a nonblocking get (please chime in if you're aware of a pattern or implementation that would benefit). Therefore I suggest we update this to apply to all remote ops.

\item $api\_hb := (hb \cup ilv \cup lco \cup rdo \cup rco \cup asw)^+$
\end{itemize}

In addition, an \textbf{API data race} is defined to relate any two accesses $A$ and $B$ where all of the following are true:
Copy link
Author

@joalsop joalsop Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: define data races between synchronizing PL-level atomics and synchronizing shmem operation observable accesses (shmem can only synchronize with shmem atomics, PL can only synchronize with PL atomics)

edit: the above is completed

\end{itemize}
Many operations have no visible accesses (e.g., memory ordering routines, whose semantics are defined by the API-level relations described in Section~\ref{subsubsec:api_relations}).

\input{content/api_observable}
Copy link
Author

@joalsop joalsop Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, observable accesses are detailed in a large table in this section, but this might not be the best option. Possible changes:

  • Keep it as a table in this section, but figure out how to format it so it looks nicer
  • Include the observable accesses within the later sections describing each individual operation
  • Include as a table in an appendix

define data race to include racy API observable accesses
and normal PL-level accesses (including atomics).
If these race, behavior is undefined.
Under the PL-level C++ memory model, events consist of reads (\textit{R}), writes (\textit{W}), atomic read-modify-writes (\textit{RMW}, which perform both a read and write to memory), and fences (\textit{F}). Events that access memory may be atomic or non-atomic. Atomic events are associated with a memory order which can be acquire, release, acquire-release, or relaxed (we leave out consume and sequential consistent memory orders here for simplicity, but this does not impact how the PL-level semantics are extended for the API-level model). The following relations are defined over the events in a candidate execution, and will be used to derive API-level relations in Section~\ref{subsubsec:api_relations}:

\begin{itemize}
\item \textbf{Sequenced before (\textit{sb}):} Relates memory events from the same thread based on the order they are evaluated (also called program order in some contexts)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more readable to expand these wherever used? same goes for newly introduced relation abbreviations. Can still set these terms apart from less formal prose using italics.

fix a few instances of errors identified by reviewers

Second, since the \openshmem API does not define the underlying PL-level implementation, API operations do not resolve to a function consisting of PL-level events. Thus, these operations must be treated as a new type of primitive in the API-level memory model: the operation event. An operation event specifies its type, and each type defines the set of minimum observable accesses triggered by the operation (described next). Operation events participate in \textit{sb} and derived relations, similar to normal thread access events. Although operation events may be associated with separate memory access events (see Section~\ref{subsubsec:api_observable}), the operation events themselves do not participate directly in access-based relations like \textit{mo}, \textit{rf}, and \textit{sw}.

\subsubsection{Per-Operation Observable Accesses}\label{subsubsec:api_observable}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a better term for this? possibly "internal accesses"?

\item $acyclic(api\_hb)$: \textit{hb} cannot have any cycles.
\item $irreflexive(rf;api\_hb)$: \textit{hb} must be consistent with the \textbf{rf} relation.
\item $irreflexive((rf^{-1})^{?};mo;rf^?;api\_hb)$: \textit{hb} must be consistent with the \textit{mo} order of writes, as well as any reads that read from writes in mo order (often called coherence requirement).
\item $acyclic(api\_hb)$: \textit{api_hb} cannot have any cycles.
Copy link

@markbrown314 markbrown314 Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot build specification without the escape for the underscore in {api_hb}

{api_hb} -> {api\_hb}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks - clearly I neglected to test that last fix commit - will apply that shortly

\item $irreflexive(rf;api\_hb)$: \textit{hb} must be consistent with the \textbf{rf} relation.
\item $irreflexive((rf^{-1})^{?};mo;rf^?;api\_hb)$: \textit{hb} must be consistent with the \textit{mo} order of writes, as well as any reads that read from writes in mo order (often called coherence requirement).
\item $acyclic(api\_hb)$: \textit{api_hb} cannot have any cycles.
\item $irreflexive(rf;api\_hb)$: \textit{api_hb} must be consistent with the \textbf{rf} relation.
Copy link

@markbrown314 markbrown314 Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{api_hb} -> {api\_hb}

\item $irreflexive((rf^{-1})^{?};mo;rf^?;api\_hb)$: \textit{hb} must be consistent with the \textit{mo} order of writes, as well as any reads that read from writes in mo order (often called coherence requirement).
\item $acyclic(api\_hb)$: \textit{api_hb} cannot have any cycles.
\item $irreflexive(rf;api\_hb)$: \textit{api_hb} must be consistent with the \textbf{rf} relation.
\item $irreflexive((rf^{-1})^{?};mo;rf^?;api\_hb)$: \textit{api_hb} must be consistent with the \textit{mo} order of writes, as well as any reads that read from writes in mo order (often called coherence requirement).
Copy link

@markbrown314 markbrown314 Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{api_hb} -> {api\_hb}

Copy link

@markbrown314 markbrown314 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment about escapes for api_hb.

\item \textbf{Sequenced before (\textit{sb}):} Relates memory events from the same thread based on the order they are evaluated (also called program order in some contexts)
\item \textbf{Modification order (\textit{mo}):} A total ordering between all writes to the same address.
\item \textbf{Reads-from (\textit{rf}):} Relates a write \textit{W} and any read \textit{R} to the same address such that \textit{R} returns the value written by \textit{W}.
\item \textbf{Synchronizes-with (\textit{sw}):} Same as \textit{rf}, but also requires that \textit{W} and \textit{R} are atomic, \textit{W} has release or acquire-release memory order, and \textit{R} has acquire or acquire-release memory order (this relation is also called synchronization order in some contexts).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this term is used in C++ but It overloads the synchronize term in shmem, maybe a different term would be preferred, e.g. acquire-release pair?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, there is no definition for acquire-release memory orders, a short reminder may be useful.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this term is used in C++ but It overloads the synchronize term in shmem, maybe a different term would be preferred, e.g. acquire-release pair?

I see that "synchronize" is used in the current spec for defining bulk synchronize patterns (e.g., barriers), but I'm hesitant to redefine the existing teminology in C++ when describing the baseline memory model - I'm worried that will add confusion. I did not find the specific term "synchronize[s] with" anywhere in the existing spec - maybe we can explicitly note that this term refers to the C++ definition when used in this section and set it out with italics?


\subsubsection{PEs and Operation Events}\label{subsubsec:api_pes_operations}

When defining behavior at the \openshmem API level, we must first consider two key differences in program execution relative to a PL-level model. First, \openshmem applications span multiple PEs, each of which has its own local memory space. Therefore, the PL-level definition of memory address is extended to include both the PE-local address and the PE ID. Normal memory accesses can only access addresses on the PE local to the thread that issued it, which is constant for the life of a thread. \openshmem operations may trigger accesses to the local PE’s memory and/or the memory of remote PEs. Dynamic memory accesses in an execution are only considered to access the same address if both the address and PE ID attributes match (impacting the definitions of of \textit{mo} and \textit{rf}).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normal memory accesses can only access addresses on the PE local to the thread that issued it, which is constant for the life of a thread. 

The thread may access all local memory (there is no notion of thread-local memory in shmem afaik), but coherency of accesses must be enforced by PL level memory model.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here I was just trying to precisely specify "local memory" - but maybe overly verbose - if it's precise enough to say "local memory", I can make that change

\subsubsubsection{Local Completion Order (lco)}
Local completion order relates local observable accesses from a blocking \openshmem operation and subsequent \textit{hb} ordered accesses and operations from the initiating PE. This ensures, for example, that a blocking put completes its accesses to the source buffer before a subsequent store from the initiating thread overwrites the data. Specifically, $(A, B) \in lco$ only if all of the following are true:
\begin{itemize}
\item $A$ is an observable access issued from a blocking operation event (i.e., an operation that does not have nbi in its name).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some operations are implicit nbi (e.g. nonfetching atomics) so the statement about not having nbi in the name is imprecise (but helpful for comprehension).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-fetching atomics don't access local memory, so lco does not apply anyways - are there any operations that are non-blocking, access local memory, and don't have nbi in the name?


Remote delivery order relates accesses separated by an \openshmem fence operation. However, this relation does not establish full connectivity; it only forms a connection between observable accesses to the same PE from operations \textit{hb} ordered before and after the fence, and between \textit{hb} ordered accesses before a fence and observable accesses from operations \textit{hb} ordered after a fence. Specifically, $(A, B) \in rdo$ only if there is an \openshmem fence operation $Fop$ and any of the following conditions are true:
\begin{itemize}
\item $A$ is a memory access that is \textit{hb} ordered before $Fop$, $B$ is an observable access issued by an operation that is \textit{hb} ordered after $Fop$, and the context associated with $Fop$ does NOT have the $OPENSHMEM\_CTX\_NOSTORE$ option enabled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that is an expected behavior in this case one should use barrier_all (If I undertand the case you discuss correctly: e.g.
0: LD(A); fence(); signal_set(pe=1, val, 1)
1: wait_until(val, 1); put(pe=0, A, newval)
I don't think this implies LD(A) is protected from reading newval, but unsure.


Although \openshmem may be implemented in multiple programming languages, this section describes how \openshmem extends the C++ memory consistency model. C++ is chosen as an example because of its generality, and because it forms the basis for common GPU programming languages (CUDA, hip). We describe how the guarantees described here can be adapted for GPU memory models in Section~\ref{subsubsec:api_scoped}.

\subsubsection{Baseline PL-level Axiomatic Memory Model (C++)}\label{subsubsec:api_pl_baseline}
Copy link
Author

@joalsop joalsop Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can explicitly cite C++ standard here - similar to in 9.6 page 55 in spec version 1.6. Should maybe more explicitly say this is not meant to be a substitute for the full specification.

We could also reference/base the memory model on C11 instead of C++ since the rest of the spec relies on this - however, C++ is useful because GPU languages are based on it. Curious what others think.

This fused operation and the associated observable accesses (if any) are considered to be executed collectively by all nodes in the synchronizing set of PEs, allowing us to establish ordering relations across these PEs (defined in Section~\ref{subsubsec:api_relations}).
This does not mean the observable accesses must occur atomically, and it does not constrain how the operation is implemented (i.e., any thread or PE in the team may perform each observable access).

\subsubsection{API-Level Relations}\label{subsubsec:api_relations}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been noted that currently the api-level relations are grouped based on the operation that generates them, but this leads to some non-intuitive implications (e.g., local completion order currently establishes ordering only for blocking ops, but local completion can also be established with a quiet). We could reorganize relations based on the type of ordering established (e.g., include quiet-established local completion as part of lco) or alternatively think of a different naming scheme...


\subsubsection{Per-Operation Observable Accesses}\label{subsubsec:api_observable}

For the purposes of the API memory model, each \openshmem operation type defines the minimum set of observable memory accesses performed by the operation. These accesses are considered to be "issued" by the associated. These accesses must occur as part of the operation, however there is no guarantee that they will be executed by the initiating thread, or that they will be the only accesses performed by the API operation implementation. Observable accesses have the same semantics as normal accesses and take part in the same relations, with one exception: there is no \textit{sb} relationship between observable accesses and the initiating thread of an API operation. Accesses performed by an implementation that are not included in the observable set must not modify data that may be observed from application code external to \openshmem operation implementations.
Copy link

@gonzalobg gonzalobg Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These accesses are considered to be "issued" by the associated [something is missing].

-text edits for clarity, consistency with the rest of the spec and
the c++ spec (e.g., address->memory location, brief prose description
of acquire/release semantics...)
-fix unescaped underscore bug

The C++ memory model is defined axiomatically in the C++ specification. Although many prior works aim to mathematize or improve this formalism, here we simply try to provide sufficient understanding to build upon it with the \openshmem API model.

Under the PL-level C++ memory model, events consist of reads (\textit{R}), writes (\textit{W}), atomic read-modify-writes (\textit{RMW}, which perform both a read and write to memory), and fences (\textit{F}). Events that access memory may be atomic or non-atomic. Atomic events are associated with a memory order which can be acquire, release, acquire-release, or relaxed (we leave out consume and sequential consistent memory orders here for simplicity, but this does not impact how the PL-level semantics are extended for the API-level model). The following relations are defined over the events in a candidate execution, and will be used to derive API-level relations in Section~\ref{subsubsec:api_relations}:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does PL-level mean?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the programming language-level C11 memory model

\begin{itemize}
\item $acyclic(hb)$: \textit{hb} cannot have any cycles.
\item $irreflexive(rf;hb)$: \textit{hb} must be consistent with the \textbf{rf} relation.
\item $irreflexive((rf^{-1})^{?};mo;rf^?;hb)$: \textit{hb} must be consistent with the \textit{mo} order of writes, as well as any reads that read from writes in mo order (often called coherence requirement).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would likely be useful to know who is reading or writing. Are these relations always referring to different PEs or are they referring to different threads in the same PE or what? In the GPU case, are we covering relations between CPU threads and GPU threads? CPU-CPU only? GPU-GPU only?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as described here, this applies to any threads that are using the C11 memory model. However, a later section describes how it is applicable to scoped memory models like hip, CUDA, HSA, etc. which may have CPU and/or GPU thread. These memory models are based on C11, but add scopes. The distinction of which devices are synchronizing impacts which scopes may be used for legal synchronization, but does not impact the semantics of the API-level model (any scoped synchronization will occur transparently within the implementation of the API call).


To precisely define what behavior the \openshmem interface guarantees for both users and implementers, we formally define an API-level memory model. An API-level model builds upon a programming language-level (PL-level) memory model, and it defines how API operations impact memory. This is useful for determining the ordering and visibility requirements for \openshmem memory accesses when implemented on any device, but it is of particular importance for devices like GPUs that have software managed caches requiring expensive bulk flush or invalidate actions to enforce memory ordering guarantees. Without a precisely specified memory model, API implementers and users may disagree on what behavior is guaranteed by the API, and this can lead to poor performance (too many bulk cache actions) or buggy code (too few bulk cache actions).

Although \openshmem may be implemented in multiple programming languages, this section describes how \openshmem extends the C++ memory consistency model. C++ is chosen as an example because of its generality, and because it forms the basis for common GPU programming languages (CUDA, hip). We describe how the guarantees described here can be adapted for GPU memory models in Section~\ref{subsubsec:api_scoped}.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe either mention all the GPU languages (SYCL, Vulkan, OpenCL) or none of them, or just say "for example"

ST R1 -> fetch, my_pe
\end{verbatim}
\tabularnewline\cline{2-2}
& \begin{verbatim}shmem_atomic_swap_nbi(fetch, dest, cond, value,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does swap_nbi really have a cond argument?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope - thanks for catching that

\item The operation event that issues $A$ is \textit{hb} ordered before $B$, or is \textit{hb} ordered before an operation event that issues $B$.
\end{itemize}

\subsubsubsection{Remote Delivery Order (rdo)}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rfo doesn't apply to put_signal? Or is that defined elsewhere. I naively think put, fence, put is equivalent to put_signal

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently the ordering between the put and the signal in put_signal is defined as an sb relationship between every put access and the final signal store (see the conditions for sb in the observable accesses section) - so it will be slightly different from put fence signal_set in that it only establishes ordering between the accesses of the put_signal operation, and doesn't establish any ordering between all prior and successive accesses like a normal fence (e.g., if you have put, put, put_signal, only the final put will have guaranteed ordering with the signal op). However, this was one area where the existing spec was not clear - I'm open to defining it differently.

\item $A$ is an observable access issued by a fence-ordered operation (defined below) that is \textit{hb} ordered before $Fop$, $B$ is an observable access issued by an operation that is \textit{hb} ordered after $Fop$, and $A$ and $B$ target the same PE.
\end{itemize}

\textbf{fence-ordered operations}: blocking or nonblocking put, blocking or nonblocking atomic memory operation, blocking or nonblocking put\_signal.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In GPU implementations, there are API calls made on the GPU, and API calls made on the host, and they are not completely ordered w.r.t. each other, Often, ordering is established only at kernel boundaries. I think it is technically feasible to have a GPU initiated FENCE affect host behavior in current implementations, but I am not sure that host FENCE can affect GPU behavior.
There are proposals to make host initiated and gpu initiated be different PEs, or sub-PEs or something, but this is not resolved.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent here is to allow a fence on a GPU to order operations on the CPU if they have established hb order between the threads in some way, and they use the same context. this may be valuable at some point - but if a programmer doesn't want that, then using separate contexts or separate PEs as you say would be a way to get around it I think.


\subsubsubsection{API Synchronizes-With (asw)}\label{subsubsubsec:api_relations_asw}

API synchronizes-with is the API-level analog of the PL-level \textit{sw} relation. It relates observable accesses from legally synchronizing \openshmem operations that form a \textit{rf} relationship. Specifically, $(A, B) \in asw$ only if $(A, B) \in rf$ and any of the following are true:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only occurance of "rf$" I can find. What should it be? I also see one asw$ so this is some markup thing I don't understand?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this refers to the reads-from relation in the underyling PL-level model - the $ is just part of the math formatting in latex

\item $B$ is an observable write or read-modify-write access issued by a synchronizing \openshmem operation.
\end{itemize}

In the above, "synchronizing \openshmem operations" are defined as any operation from the following categories: atomic memory operations, signaling operations, point-to-point synchronization routings, or distributed locking routines.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to exclude shmem_sync here? or quiet + sync? or barrier?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - these do not have observable accesses - so they can't form a synchronizing relationship as defined here

\item $acyclic(api\_hb)$: \textit{api_hb} cannot have any cycles.
\item $irreflexive(rf;api\_hb)$: \textit{api_hb} must be consistent with the \textbf{rf} relation.
\item $irreflexive((rf^{-1})^{?};mo;rf^?;api\_hb)$: \textit{api_hb} must be consistent with the \textit{mo} order of writes, as well as any reads that read from writes in mo order (often called coherence requirement).
\item $irreflexive(rf;(sb^{amo\_op})^? \cup mo;mo;(;sb^{amo\_op})^?;rf^-1 \cup mo;rf;(sb^{amo\_op})^?$ where $sb^{amo\_op}$ relates $sb$ ordered observable accesses issued by the same \openshmem atomic memory operation: The read of an atomic read-modify-write access must always read the last value (in the modification order) written before the write of the read-modify-write access. Similarly, the observable read of an \openshmem atomic memory operation must always read the last value (in the modification order) written before the observable write of the atomic memory operation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the spec, shmem atomics are only atomic w.r.t. other shmem atomics. This statement is far stricter, and doesn't match up with implementations using NIC atomics.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we avoid interactions/guarantees of atomicity between shmem atomics and "normal" PL-level atomics by defining them as a data race in the API-level model. That is, if shmem atomics and normal atomics race, that forms a data race. Does that address this concern? This is an area that I'm less familiar with, so open to alternative solutions.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that SHMEM users really should not mix atomic and non atomic references to the same addresses.

return R1
\end{verbatim}
\tabularnewline\cline{2-2}
& \begin{verbatim}shmem_atomic_swap(dest, cond, value, pe):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't have cond

Use C11 as the PL-level basis for the formal API-level model.
This is consistent with the rest of the document, and differs
from the C++ model in no relevant ways.
Minor fixes to observable accesses, grammar, and
formatting for consistency.

To precisely define what behavior the \openshmem interface guarantees for both users and implementers, we formally define an API-level memory model. An API-level model builds upon a programming language-level (PL-level) memory model, and it defines how API operations impact memory. This is useful for determining the ordering and visibility requirements for \openshmem memory accesses when implemented on any device, but it is of particular importance for devices like GPUs that have software managed caches requiring expensive bulk flush or invalidate actions to enforce memory ordering guarantees. Without a precisely specified memory model, API implementers and users may disagree on what behavior is guaranteed by the API, and this can lead to poor performance (too many bulk cache actions) or buggy code (too few bulk cache actions).

This section describes how \openshmem extends the C11 memory consistency model. C11 is chosen as an example because of its generality and ubiquity in the CPU domain, and because it forms the basis for common GPU programming languages (for example, CUDA, hip). Section~\ref{subsubsec:api_pl_baseline} summarizes the baseline C11 memory model, Sections~\ref{subsubsec:api_model_overview}-\ref{subsubsec:api_integrate} describe how this programming language-level model is extended to provide API-level semantics for \openshmem, and Section~\ref{subsubsec:api_scoped} describes how the guarantees of the API-level model can be adapted for GPU memory models.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section - > This Section
?


\subsubsection{API-Level Memory Model Overview}\label{subsubsec:api_model_overview}

The \openshmem API-level memory model extends PL-level memory models like C11 to define how \openshmem operations interact with memory. While PL-level models operate on fine-grain loads, stores, and fences, the API-level memory model must additionally operate on coarse-grain API operations like get, put, barrier, etc. that can execute asynchronously and involve multiple PEs.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't there macros for get / put ? also PEs -> \acp{PE}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants