From b0665611398c200c40a9a7227109707ff843806c Mon Sep 17 00:00:00 2001 From: Rich Hornung Date: Tue, 23 Dec 2025 11:42:58 -0800 Subject: [PATCH 01/16] First cut at some RAJA Perf content --- docs/13_rajaperf/rajaperf.rst | 27 ++++++++++++++++++++++++++- 1 file changed, 26 insertions(+), 1 deletion(-) diff --git a/docs/13_rajaperf/rajaperf.rst b/docs/13_rajaperf/rajaperf.rst index 4480b1f..1e300eb 100644 --- a/docs/13_rajaperf/rajaperf.rst +++ b/docs/13_rajaperf/rajaperf.rst @@ -2,12 +2,37 @@ RAJA Performance Suite ********************** -https://github.com/LLNL/RAJAPerf +The RAJA Performance Suite is a companion project to the +`RAJA project `_, which is an open-source library +of C++ abstractions that enable single-source portable application code. The +RAJA Performance Suite contains loop-based computational kernels representative +of those found in production HPC applications. Each kernel appears in RAJA and +non-RAJA variants to enable comparison of performance between them. + +The RAJA Performance Suite is available at https://github.com/LLNL/RAJAPerf Purpose ======= +The RAJA Performance Suite is designed to analyze performance of loop-based +computational kernels found in HPC applications, specifically those implemented +using `RAJA `_. Each kernel in the Suite appears +in multiple RAJA and *non-RAJA* variants using common parallel programming +models, such as OpenMP, CUDA, HIP, and SYCL. + +The kernels in the RAJA Performance Suite originate from other HPC benchmark +suites as well as peroduction applications. Kernels are chosen and/or +developed for performance analysis of RAJA on various types of loop structures +(e.g., simple for-loops, perfectly and non-perfectly nested for-loops) and +operations (e.g., reductions, atomics, scans, sorts). In particular, many +kernels are designed to reproduce compiler optimization and other issues +observed in real applications that use RAJA. The RAJA team works with compiler +and hardware vendors to resolve the issues. + +The RAJA Performance Suite benchmark exercises a small subset of kernels in the +Suite that are chosen because they represent important computational patterns +in relevant applications. Characteristics =============== From 08e60a1660a57a61064a50615507ed92d57dcb46 Mon Sep 17 00:00:00 2001 From: Rich Hornung Date: Tue, 23 Dec 2025 16:38:52 -0800 Subject: [PATCH 02/16] Fleshing out more description of the Suite and what it does. --- docs/13_rajaperf/rajaperf.rst | 83 ++++++++++++++++++++++++----------- 1 file changed, 57 insertions(+), 26 deletions(-) diff --git a/docs/13_rajaperf/rajaperf.rst b/docs/13_rajaperf/rajaperf.rst index 84fb8f3..14e2d2d 100644 --- a/docs/13_rajaperf/rajaperf.rst +++ b/docs/13_rajaperf/rajaperf.rst @@ -2,50 +2,81 @@ RAJA Performance Suite ********************** -The RAJA Performance Suite is a companion project to the -`RAJA project `_, which is an open-source library -of C++ abstractions that enable single-source portable application code. The -RAJA Performance Suite contains loop-based computational kernels representative -of those found in production HPC applications. Each kernel appears in RAJA and -non-RAJA variants to enable comparison of performance between them. +RAJA Performance Suite source code is near-final at this point. The problems to run are yet to be finalized. -The RAJA Performance Suite is available at https://github.com/LLNL/RAJAPerf +The RAJA Performance Suite contains a collection of computational kernels that +represent important computational patterns found in HPC applications. It is a +companion project to RAJA, which is a library of software abstractions used to +write portable, single-source application code in C++. The Suite provides a +means to assess and analyze RAJA performance and, in particular, to compare +kernel implementations that use RAJA and those that do not use RAJA. -RAJA Performance Suite source code is near-final at this point. The problems to run are yet to be finalized. +Source code and documentation for RAJA and RAJA Performance Suite is +available at: + + * `RAJA Performance Suite GitHub project `_ + + * `RAJA GitHub project `_ + +.. important:: The RAJA Performance Suite benchmark is limited to a subset of + kernels in the RAJA Performance Suite, as described below. Purpose ======= -The RAJA Performance Suite is designed to analyze performance of loop-based -computational kernels found in HPC applications, specifically those implemented -using `RAJA `_. Each kernel in the Suite appears -in multiple RAJA and *non-RAJA* variants using common parallel programming -models, such as OpenMP, CUDA, HIP, and SYCL. - -The kernels in the RAJA Performance Suite originate from other HPC benchmark -suites as well as peroduction applications. Kernels are chosen and/or -developed for performance analysis of RAJA on various types of loop structures -(e.g., simple for-loops, perfectly and non-perfectly nested for-loops) and -operations (e.g., reductions, atomics, scans, sorts). In particular, many -kernels are designed to reproduce compiler optimization and other issues -observed in real applications that use RAJA. The RAJA team works with compiler -and hardware vendors to resolve the issues. - -The RAJA Performance Suite benchmark exercises a small subset of kernels in the -Suite that are chosen because they represent important computational patterns -in relevant applications. +The RAJA Performance Suite is used to analyze performance of loop-based +computational kernels representative of those found in HPC applications and +which are implemented using `RAJA `_. Each kernel +in the Suite appears in RAJA and *non-RAJA* variants that employ standard or +vendor-defined parallel programming models, such as OpenMP, CUDA, HIP, and +SYCL. RAJA and non-RAJA variants enable comparison of performance and +compiler-generated code that uses RAJA and that which does not. + +The kernels in the RAJA Performance Suite originate from open-source HPC +benchmark suites and restricted-access production applications. Kernels +represent various types of loop structures, such as simple for-loops, +perfectly and non-perfectly nested for-loops, and important parallel operations +including reductions, atomics, scans, and sorts. Often, kernels in the Suite +are developed to serve as reproducers of performance and compiler optimization +issues observed in production applications that use RAJA. + +When used, RAJA is the *X* in *MPI + X* parallel application paradigm, where +MPI is used for distributed memory, multi-node parallelism and X (RAJA in this +case) supports fine-grained parallelism within an MPI rank. The RAJA Performance +Suite supports MPI so that performance of a kernel in the Suite aligns with how +the kernel would perform in a real application. For example, observed memory +bandwidth may be different when running on a many core system using OpenMP +multithreading to exercise all cores than when each core is mapped to an MPI +rank. Similarly, on a system where a GPU can be partitioned into multiple +compute devices, performance can be different when running only a single +partition than when exercising the entire GPU with each partition assigned to +a different MPI rank. + Characteristics =============== +The RAJA Performance Suite repository contains all of its software dependencies +as submodules whose versions are pinned to the Suite version. Thus, +recursively cloning the Suite repo and its submodules is all that is needed to +configure, build, and run the Suite. + +The Suite is designed so that its key parameters and options are defined via +command-line options. The intent is that one would write scripts to execute +a series of Suite runs to generate data for a performance experiment. + Problems -------- +List and describe subset of kernels in the benchmark.... + Figure of Merit --------------- + + Source code modifications ========================= From f7c43b096fd32a3fe9178c0071e18c86d7ce6a94 Mon Sep 17 00:00:00 2001 From: Rich Hornung Date: Thu, 8 Jan 2026 07:56:43 -0800 Subject: [PATCH 03/16] Add list of kernels and brief descriptions --- docs/13_rajaperf/rajaperf.rst | 104 ++++++++++++++++++++++------------ 1 file changed, 69 insertions(+), 35 deletions(-) diff --git a/docs/13_rajaperf/rajaperf.rst b/docs/13_rajaperf/rajaperf.rst index 14e2d2d..bdaf366 100644 --- a/docs/13_rajaperf/rajaperf.rst +++ b/docs/13_rajaperf/rajaperf.rst @@ -4,14 +4,14 @@ RAJA Performance Suite RAJA Performance Suite source code is near-final at this point. The problems to run are yet to be finalized. -The RAJA Performance Suite contains a collection of computational kernels that +The RAJA Performance Suite contains a variety of numerical kernels that represent important computational patterns found in HPC applications. It is a -companion project to RAJA, which is a library of software abstractions used to -write portable, single-source application code in C++. The Suite provides a -means to assess and analyze RAJA performance and, in particular, to compare -kernel implementations that use RAJA and those that do not use RAJA. +companion project to RAJA, which is a library of software abstractions enabling +developers of C++ applications to write portable, single-source code. The Suite +provides mechanisms to analyze RAJA performance and, in particular, to compare +performance of kernel implementations that use RAJA and those that do not. -Source code and documentation for RAJA and RAJA Performance Suite is +Source code and documentation for RAJA and the RAJA Performance Suite is available at: * `RAJA Performance Suite GitHub project `_ @@ -19,57 +19,91 @@ available at: * `RAJA GitHub project `_ .. important:: The RAJA Performance Suite benchmark is limited to a subset of - kernels in the RAJA Performance Suite, as described below. + kernels in the RAJA Performance Suite as described below. Purpose ======= -The RAJA Performance Suite is used to analyze performance of loop-based -computational kernels representative of those found in HPC applications and -which are implemented using `RAJA `_. Each kernel -in the Suite appears in RAJA and *non-RAJA* variants that employ standard or -vendor-defined parallel programming models, such as OpenMP, CUDA, HIP, and -SYCL. RAJA and non-RAJA variants enable comparison of performance and +The main purpose of the RAJA Performance Suite is to analyze performance of +loop-based computational kernels representative of those found in HPC +applications and which are implemented using `RAJA `_. +Each kernel in the Suite appears in RAJA and *non-RAJA* variants that exercise +common parallel programming models, such as OpenMP, CUDA, HIP, and SYCL. +RAJA and non-RAJA variants enable comparison of performance and compiler-generated code that uses RAJA and that which does not. The kernels in the RAJA Performance Suite originate from open-source HPC benchmark suites and restricted-access production applications. Kernels -represent various types of loop structures, such as simple for-loops, -perfectly and non-perfectly nested for-loops, and important parallel operations -including reductions, atomics, scans, and sorts. Often, kernels in the Suite -are developed to serve as reproducers of performance and compiler optimization -issues observed in production applications that use RAJA. - -When used, RAJA is the *X* in *MPI + X* parallel application paradigm, where -MPI is used for distributed memory, multi-node parallelism and X (RAJA in this +employ various loop structures and parallel operations such as reductions, +atomics, scans, and sorts. Often, kernels in the Suite are developed to +provide vendors with simplified reproducers of performance and compiler +optimization issues observed in production applications that use RAJA. + +RAJA is the *X* in *MPI + X* parallel application paradigm, where MPI is used +for coarse-grained, distributed memory parallelism and X (RAJA in this case) supports fine-grained parallelism within an MPI rank. The RAJA Performance -Suite supports MPI so that performance of a kernel in the Suite aligns with how -the kernel would perform in a real application. For example, observed memory -bandwidth may be different when running on a many core system using OpenMP -multithreading to exercise all cores than when each core is mapped to an MPI -rank. Similarly, on a system where a GPU can be partitioned into multiple -compute devices, performance can be different when running only a single -partition than when exercising the entire GPU with each partition assigned to -a different MPI rank. +Suite supports MPI so that execution of kernels in the Suite aligns with the +way individual kernels are exercised in production HPC applications. For +example, we may want to compare performance of a kernel running on a many core +system using OpenMP multithreading to exercise all cores and the case where +each core is mapped to an MPI rank and code within each rank is executed +sequentially. Similarly, on a system where a GPU can be partitioned into +multiple compute devices, we may want to compare performance of different +GPU partitionings where each partition is assigned to a different MPI rank. Characteristics =============== The RAJA Performance Suite repository contains all of its software dependencies -as submodules whose versions are pinned to the Suite version. Thus, -recursively cloning the Suite repo and its submodules is all that is needed to -configure, build, and run the Suite. +in Git submodules; thus dependency versions are pinned to each version of +the Suite. Building the Suite requires a C++17 compliant compiler and an +MPI library installation, if MPI is used. The Suite is designed so that its key parameters and options are defined via -command-line options. The intent is that one would write scripts to execute -a series of Suite runs to generate data for a performance experiment. +command-line options. The intent is that one can build the code and use scripts +to execute a series of Suite runs to generate data for desired performance +experiments. Problems -------- -List and describe subset of kernels in the benchmark.... +The RAJA Performance Suite benchmark is limited to a subset of kernels. + +.. note:: There is a reference description for each kernel located in the + header file for the kernel object ``kernel-name.hpp``. The + reference is a C-style sequential implementation of the kernel in + a comment section near the top of the header file. + + * *Apps* group (directory src/apps) + + #. **CONVECTION3DPA** action of a 3D finite element convection operator (matrix) via partial assembly + #. **DEL_DOT_VEC_2D** divergence of a vector field on a set of points on a mesh, where the mesh points are traversed using an indirection array + #. **DIFFUSION3DPA** action of a 3D finite element diffusion operator (matrix) via partial assembly + #. **EDGE3D** stiffness matrix assembly for a 3D MHD calculation + #. **ENERGY** internal energy calculation for an explicit hydrodynamics calculation; illustrates conditional logic used to apply various cutoffs + #. **FEMSWEEP** linear sweep used in a finite element implementation of radiation transport + #. **INTSC_HEXHEX** intersection between two 24-sided hexahedra, including volume and moment calculations + #. **INTSC_HEXRECT** intersection between a 24-sided hexahedron and a rectangular solid, including volume and moment calculations + #. **LTIMES** one step of the source-iteration technique for solving the steady-state linear Boltzmann equation -- multi-dimensional matrix product + #. **MASS3DEA** assembly of a 3D finite element mass matrix + #. **MASSVEC3DPA** action of a 3D finite element mass matrix via partial assembly on a block vector + #. **MATVEC_3D_STENCIL** matrix-vector product based on a 3D mesh stencil + #. **NODAL_ACCUMULATION_3D** on a 3D structured hexahedral mesh, sum a constribution from each hex vertex (nodal value) to its centroid (zonal value) -- 8-way atomic contention + #. **VOL3D** on a 3D structured hexahedral mesh (faces are not necessarily planes), compute volume of each zone (hex) + + * *Basic* group (directory src/basic) + + #. **INDEXLIST_3LOOP** construction of list of indices based on some boolean test to enumerate iterates for a subsequent kernel execution -- exercises vendor scan implementations + #. **MULTI_REDUCE** multiple reductions in a kernel, where number of reductions is set at run time + #. **REDUCE_STRUCT** multiple reductions in a kernel, where number of reductions (6) is known at compile time + + * *Comm* group (directory src/comm) + + #. **HALO_EXCHANGE_FUSED** packing and unpacking MPI message buffers for point-to-point distributed memory communication -- represents halo data exchange for mesh-based codes + + Figure of Merit --------------- From 3792cf6f0b6ae57f3f761484a1c81b10c2a75e4a Mon Sep 17 00:00:00 2001 From: Rich Hornung Date: Thu, 8 Jan 2026 15:46:24 -0800 Subject: [PATCH 04/16] Attempt to describe relevant aspects of each kernel --- docs/13_rajaperf/rajaperf.rst | 113 ++++++++++++++++++---------------- 1 file changed, 59 insertions(+), 54 deletions(-) diff --git a/docs/13_rajaperf/rajaperf.rst b/docs/13_rajaperf/rajaperf.rst index bdaf366..83cf3c5 100644 --- a/docs/13_rajaperf/rajaperf.rst +++ b/docs/13_rajaperf/rajaperf.rst @@ -6,10 +6,10 @@ RAJA Performance Suite source code is near-final at this point. The problems to The RAJA Performance Suite contains a variety of numerical kernels that represent important computational patterns found in HPC applications. It is a -companion project to RAJA, which is a library of software abstractions enabling +companion project to RAJA, which is a library of software abstractions used by developers of C++ applications to write portable, single-source code. The Suite -provides mechanisms to analyze RAJA performance and, in particular, to compare -performance of kernel implementations that use RAJA and those that do not. +enables RAJA performance analysis experiments and performance comparisons +between kernel implementations that use RAJA and those that do not. Source code and documentation for RAJA and the RAJA Performance Suite is available at: @@ -28,80 +28,85 @@ Purpose The main purpose of the RAJA Performance Suite is to analyze performance of loop-based computational kernels representative of those found in HPC applications and which are implemented using `RAJA `_. -Each kernel in the Suite appears in RAJA and *non-RAJA* variants that exercise +Each kernel in the Suite appears in RAJA and non-RAJA variants that exercise common parallel programming models, such as OpenMP, CUDA, HIP, and SYCL. -RAJA and non-RAJA variants enable comparison of performance and -compiler-generated code that uses RAJA and that which does not. - -The kernels in the RAJA Performance Suite originate from open-source HPC -benchmark suites and restricted-access production applications. Kernels -employ various loop structures and parallel operations such as reductions, -atomics, scans, and sorts. Often, kernels in the Suite are developed to -provide vendors with simplified reproducers of performance and compiler -optimization issues observed in production applications that use RAJA. - -RAJA is the *X* in *MPI + X* parallel application paradigm, where MPI is used -for coarse-grained, distributed memory parallelism and X (RAJA in this -case) supports fine-grained parallelism within an MPI rank. The RAJA Performance -Suite supports MPI so that execution of kernels in the Suite aligns with the -way individual kernels are exercised in production HPC applications. For -example, we may want to compare performance of a kernel running on a many core -system using OpenMP multithreading to exercise all cores and the case where -each core is mapped to an MPI rank and code within each rank is executed -sequentially. Similarly, on a system where a GPU can be partitioned into -multiple compute devices, we may want to compare performance of different -GPU partitionings where each partition is assigned to a different MPI rank. +RAJA and non-RAJA variants enable performance comparisons between +implementations that use RAJA and those that do not. Such comparisons are +helpful to improve RAJA implementation and to identify potential impacts +C++ abstractions have on compilers' ability to optimize. + +The kernels in the RAJA Performance Suite originate from different sources +ranging from open-source HPC benchmark suites to restricted-access production +applications. Kernels exercise various loop structures as well as parallel +operations such as reductions, atomics, scans, and sorts. Often, kernels in +the Suite are developed to work with vendors to resolve performance issues +observed in production applications that use RAJA. + +RAJA is a potential *X* in the commonly used *MPI + X* parallel application +paradigm, where MPI is used for coarse-grained, distributed memory parallelism +and X (RAJA in this case) enables fine-grained parallelism in each MPI rank. +The RAJA Performance Suite supports MPI so that execution of kernels in the +Suite aligns with the way numerical kernels are exercised in an MPI + X HPC +applications. For example, one may want to compare performance of a kernel +running on a many core system using OpenMP multithreading to exercise all +cores and the case where each core is mapped to an MPI rank and code within +each rank is executed sequentially. Similarly, on a system where a GPU can be +partitioned into multiple compute devices, one may want to compare performance +of different GPU partitionings where each partition is assigned to a different +MPI rank. Characteristics =============== -The RAJA Performance Suite repository contains all of its software dependencies -in Git submodules; thus dependency versions are pinned to each version of -the Suite. Building the Suite requires a C++17 compliant compiler and an -MPI library installation, if MPI is used. +The `RAJA Performance Suite GitHub project `_ +contains the code for all the Suite kernels and all of essential software +dependencies in Git submodules. Thus, dependency versions are pinned to each +version of the Suite. Building the Suite requires an installation of CMake for +configuring a build, a C++17 compliant compiler to build the code, and an MPI +library installation, if MPI is to be used. -The Suite is designed so that its key parameters and options are defined via -command-line options. The intent is that one can build the code and use scripts -to execute a series of Suite runs to generate data for desired performance -experiments. +The Suite can be run in a myriad of ways by specifying parameters and options +in its command-line interface. The intent is that one can build the code and +use scripts to execute a series of Suite runs to generate data for each desired +performance experiment. Problems -------- -The RAJA Performance Suite benchmark is limited to a subset of kernels. +The RAJA Performance Suite benchmark is limited to a subset of kernels +listed below. -.. note:: There is a reference description for each kernel located in the - header file for the kernel object ``kernel-name.hpp``. The +.. note:: Each kernel contains a reference description which is located in the + header file for the kernel object ``.hpp``. The reference is a C-style sequential implementation of the kernel in a comment section near the top of the header file. * *Apps* group (directory src/apps) - #. **CONVECTION3DPA** action of a 3D finite element convection operator (matrix) via partial assembly - #. **DEL_DOT_VEC_2D** divergence of a vector field on a set of points on a mesh, where the mesh points are traversed using an indirection array - #. **DIFFUSION3DPA** action of a 3D finite element diffusion operator (matrix) via partial assembly - #. **EDGE3D** stiffness matrix assembly for a 3D MHD calculation - #. **ENERGY** internal energy calculation for an explicit hydrodynamics calculation; illustrates conditional logic used to apply various cutoffs - #. **FEMSWEEP** linear sweep used in a finite element implementation of radiation transport - #. **INTSC_HEXHEX** intersection between two 24-sided hexahedra, including volume and moment calculations - #. **INTSC_HEXRECT** intersection between a 24-sided hexahedron and a rectangular solid, including volume and moment calculations - #. **LTIMES** one step of the source-iteration technique for solving the steady-state linear Boltzmann equation -- multi-dimensional matrix product - #. **MASS3DEA** assembly of a 3D finite element mass matrix - #. **MASSVEC3DPA** action of a 3D finite element mass matrix via partial assembly on a block vector - #. **MATVEC_3D_STENCIL** matrix-vector product based on a 3D mesh stencil - #. **NODAL_ACCUMULATION_3D** on a 3D structured hexahedral mesh, sum a constribution from each hex vertex (nodal value) to its centroid (zonal value) -- 8-way atomic contention - #. **VOL3D** on a 3D structured hexahedral mesh (faces are not necessarily planes), compute volume of each zone (hex) + #. **CONVECTION3DPA** action of a 3D finite element convection operator via partial assembly *(nested loops, GPU shared memory, RAJA::launch API)* + #. **DEL_DOT_VEC_2D** divergence of a vector field at a set of points on a mesh *(single loop, data access via indirection array, RAJA::forall API)* + #. **DIFFUSION3DPA** action of a 3D finite element diffusion operator via partial assembly *(nested loops, GPU shared memory, RAJA::launch API)* + #. **EDGE3D** stiffness matrix assembly for a 3D MHD calculation *(single loop with included function call, RAJA::forall API)* + #. **ENERGY** internal energy calculation from an explicit hydrodynamics algorithm; *(multiple single-loop operations in sequence, conditional logic for correctness checks and cutoffs, RAJA::forall API)* + #. **FEMSWEEP** finite element implementation of linear sweep algorithm used in radiation transport *(nested loops, RAJA::launch API)* + #. **INTSC_HEXHEX** intersection between two 24-sided hexahedra, including volume and moment calculations *(multiple single-loop operations in sequence, RAJA::forall API)* + #. **INTSC_HEXRECT** intersection between a 24-sided hexahedron and a rectangular solid, including volume and moment calculations *(single loop, RAJA::forall API)* + #. **LTIMES** one step of the source-iteration technique for solving the steady-state linear Boltzmann equation, multi-dimensional matrix product *(nested loops, RAJA::kernel API)* + #. **MASS3DEA** assembly of a 3D finite element mass matrix *(nested loops, GPU shared memory, RAJA::launch API)* + #. **MASSVEC3DPA** action of a 3D finite element mass matrix via partial assembly on a block vector *(nested loops, GPU shared memory, RAJA::launch API)* + #. **MATVEC_3D_STENCIL** matrix-vector product based on a 3D mesh stencil *(single loop, data access via indirection array, RAJA::forall API)* + #. **NODAL_ACCUMULATION_3D** on a 3D structured hexahedral mesh, sum a constribution from each hex vertex (nodal value) to its centroid (zonal value) *(single loop, data access via indirection array, 8-way atomic contention, RAJA::forall API)* + #. **VOL3D** on a 3D structured hexahedral mesh (faces are not necessarily planes), compute volume of each zone (hex) *(single loop, RAJA::forall API)* * *Basic* group (directory src/basic) - #. **INDEXLIST_3LOOP** construction of list of indices based on some boolean test to enumerate iterates for a subsequent kernel execution -- exercises vendor scan implementations - #. **MULTI_REDUCE** multiple reductions in a kernel, where number of reductions is set at run time - #. **REDUCE_STRUCT** multiple reductions in a kernel, where number of reductions (6) is known at compile time + #. **MULTI_REDUCE** multiple reductions in a kernel, where number of reductions is set at run time *(single loop, irregular atomic contention, RAJA::forall API)* + #. **REDUCE_STRUCT** multiple reductions in a kernel, where number of reductions (6) is known at compile time *(single loop, multiple reductions, RAJA::forall API)* * *Comm* group (directory src/comm) - #. **HALO_EXCHANGE_FUSED** packing and unpacking MPI message buffers for point-to-point distributed memory communication -- represents halo data exchange for mesh-based codes + #. **HALO_EXCHANGE_FUSED** packing and unpacking MPI message buffers for point-to-point distributed memory halo data exchange for mesh-based codes *(overhead of launching many small kernels, GPU variants use RAJA::Workgroup concepts to execute multiple kernels with one launch)* From 6a080c66925d1c92203eb297b861d251bf513559 Mon Sep 17 00:00:00 2001 From: Rich Hornung Date: Fri, 9 Jan 2026 15:17:20 -0800 Subject: [PATCH 05/16] More cleanup --- docs/13_rajaperf/rajaperf.rst | 48 +++++++++++++++++------------------ 1 file changed, 24 insertions(+), 24 deletions(-) diff --git a/docs/13_rajaperf/rajaperf.rst b/docs/13_rajaperf/rajaperf.rst index 83cf3c5..9555759 100644 --- a/docs/13_rajaperf/rajaperf.rst +++ b/docs/13_rajaperf/rajaperf.rst @@ -7,9 +7,9 @@ RAJA Performance Suite source code is near-final at this point. The problems to The RAJA Performance Suite contains a variety of numerical kernels that represent important computational patterns found in HPC applications. It is a companion project to RAJA, which is a library of software abstractions used by -developers of C++ applications to write portable, single-source code. The Suite -enables RAJA performance analysis experiments and performance comparisons -between kernel implementations that use RAJA and those that do not. +developers of C++ applications to write portable, single-source code. The RAJA +Performance Suite enables performance experiments and comparisons for kernel +variants that use RAJA and those that do not. Source code and documentation for RAJA and the RAJA Performance Suite is available at: @@ -28,32 +28,32 @@ Purpose The main purpose of the RAJA Performance Suite is to analyze performance of loop-based computational kernels representative of those found in HPC applications and which are implemented using `RAJA `_. +The kernels in the Suite originate from different sources ranging from +open-source HPC benchmarks to restricted-access production applications. +Kernels exercise various loop structures as well as parallel operations such +as reductions, atomics, scans, and sorts. + Each kernel in the Suite appears in RAJA and non-RAJA variants that exercise -common parallel programming models, such as OpenMP, CUDA, HIP, and SYCL. -RAJA and non-RAJA variants enable performance comparisons between -implementations that use RAJA and those that do not. Such comparisons are -helpful to improve RAJA implementation and to identify potential impacts -C++ abstractions have on compilers' ability to optimize. - -The kernels in the RAJA Performance Suite originate from different sources -ranging from open-source HPC benchmark suites to restricted-access production -applications. Kernels exercise various loop structures as well as parallel -operations such as reductions, atomics, scans, and sorts. Often, kernels in -the Suite are developed to work with vendors to resolve performance issues +common programming models, such as OpenMP, CUDA, HIP, and SYCL. Performance +comparisons between RAJA and non-RAJA variants are helpful to improve RAJA +implementation and to identify impacts C++ abstractions have on compilers' +ability to optimize. Often, kernels in the Suite serve as collaboration tools +enabling the RAJA team to work with vendors to resolve performance issues observed in production applications that use RAJA. RAJA is a potential *X* in the commonly used *MPI + X* parallel application paradigm, where MPI is used for coarse-grained, distributed memory parallelism -and X (RAJA in this case) enables fine-grained parallelism in each MPI rank. -The RAJA Performance Suite supports MPI so that execution of kernels in the -Suite aligns with the way numerical kernels are exercised in an MPI + X HPC -applications. For example, one may want to compare performance of a kernel -running on a many core system using OpenMP multithreading to exercise all -cores and the case where each core is mapped to an MPI rank and code within -each rank is executed sequentially. Similarly, on a system where a GPU can be -partitioned into multiple compute devices, one may want to compare performance -of different GPU partitionings where each partition is assigned to a different -MPI rank. +and X (e.g., RAJA) supports fine-grained parallelism within an MPI rank. +The RAJA Performance Suite can be configured with MPI so that execution of +kernels in the Suite is representative of the ways in which numerical kernels +are exercised in an MPI + X HPC applications. When the RAJA Performance Suite +is run using multiple MPI ranks, the same kernel code is executed on each rank. +Synchronization and communication across ranks involves only sending execution +timing information to rank zero. + +.. important:: For RAJA Performance Suite benchmark execution, MPI must be used + to ensure that all resources on a compute node are being + exercised and avoid misrepresenting node performance. Characteristics From d75784014c4598a138c50c0d1e69d1d8d3c849e1 Mon Sep 17 00:00:00 2001 From: Rich Hornung Date: Thu, 15 Jan 2026 10:07:29 -0800 Subject: [PATCH 06/16] Attempt to prioritize kernels in terms of importance --- docs/13_rajaperf/rajaperf.rst | 30 ++++++++++++++++++++---------- 1 file changed, 20 insertions(+), 10 deletions(-) diff --git a/docs/13_rajaperf/rajaperf.rst b/docs/13_rajaperf/rajaperf.rst index 9555759..2eb8b22 100644 --- a/docs/13_rajaperf/rajaperf.rst +++ b/docs/13_rajaperf/rajaperf.rst @@ -2,7 +2,7 @@ RAJA Performance Suite ********************** -RAJA Performance Suite source code is near-final at this point. The problems to run are yet to be finalized. +RAJA Performance Suite source code is near-final at this point. The problems to run are not yet finalized. The RAJA Performance Suite contains a variety of numerical kernels that represent important computational patterns found in HPC applications. It is a @@ -82,29 +82,39 @@ listed below. reference is a C-style sequential implementation of the kernel in a comment section near the top of the header file. - * *Apps* group (directory src/apps) + * *Apps* group (directory src/apps) -- Tier 1 - #. **CONVECTION3DPA** action of a 3D finite element convection operator via partial assembly *(nested loops, GPU shared memory, RAJA::launch API)* - #. **DEL_DOT_VEC_2D** divergence of a vector field at a set of points on a mesh *(single loop, data access via indirection array, RAJA::forall API)* #. **DIFFUSION3DPA** action of a 3D finite element diffusion operator via partial assembly *(nested loops, GPU shared memory, RAJA::launch API)* #. **EDGE3D** stiffness matrix assembly for a 3D MHD calculation *(single loop with included function call, RAJA::forall API)* #. **ENERGY** internal energy calculation from an explicit hydrodynamics algorithm; *(multiple single-loop operations in sequence, conditional logic for correctness checks and cutoffs, RAJA::forall API)* #. **FEMSWEEP** finite element implementation of linear sweep algorithm used in radiation transport *(nested loops, RAJA::launch API)* #. **INTSC_HEXHEX** intersection between two 24-sided hexahedra, including volume and moment calculations *(multiple single-loop operations in sequence, RAJA::forall API)* #. **INTSC_HEXRECT** intersection between a 24-sided hexahedron and a rectangular solid, including volume and moment calculations *(single loop, RAJA::forall API)* - #. **LTIMES** one step of the source-iteration technique for solving the steady-state linear Boltzmann equation, multi-dimensional matrix product *(nested loops, RAJA::kernel API)* - #. **MASS3DEA** assembly of a 3D finite element mass matrix *(nested loops, GPU shared memory, RAJA::launch API)* + #. **MASS3DEA** element assembly of a 3D finite element mass matrix *(nested loops, GPU shared memory, RAJA::launch API)* + #. **MASS3DPA_ATOMIC** action of a 3D finite element mass matrix via partial assembly *(nested loops, GPU shared memory, RAJA::launch API)* #. **MASSVEC3DPA** action of a 3D finite element mass matrix via partial assembly on a block vector *(nested loops, GPU shared memory, RAJA::launch API)* #. **MATVEC_3D_STENCIL** matrix-vector product based on a 3D mesh stencil *(single loop, data access via indirection array, RAJA::forall API)* #. **NODAL_ACCUMULATION_3D** on a 3D structured hexahedral mesh, sum a constribution from each hex vertex (nodal value) to its centroid (zonal value) *(single loop, data access via indirection array, 8-way atomic contention, RAJA::forall API)* - #. **VOL3D** on a 3D structured hexahedral mesh (faces are not necessarily planes), compute volume of each zone (hex) *(single loop, RAJA::forall API)* + #. **VOL3D** on a 3D structured hexahedral mesh (faces are not necessarily planes), compute volume of each zone (hex) *(single loop, data access via indirection array, RAJA::forall API)* + + * *Apps* group (directory src/apps) -- Tier 2 - * *Basic* group (directory src/basic) + #. **CONVECTION3DPA** action of a 3D finite element convection operator via partial assembly *(nested loops, GPU shared memory, RAJA::launch API)* + #. **DEL_DOT_VEC_2D** divergence of a vector field at a set of points on a mesh *(single loop, data access via indirection array, RAJA::forall API)* + #. **MASS3DPA** action of a 3D finite element mass matrix via partial assembly *(nested loops, GPU shared memory, RAJA::launch API)* + #. **MASSVEC3DPA_ATOMIC** action of a 3D finite element mass matrix via partial assembly on a block vector *(nested loops, GPU shared memory, RAJA::launch API)* - #. **MULTI_REDUCE** multiple reductions in a kernel, where number of reductions is set at run time *(single loop, irregular atomic contention, RAJA::forall API)* + + * *Basic* group (directory src/basic) -- Tier 1 + + #. **MULTI_REDUCE** multiple reductions in a kernel, where number of reductions is set at run time *(single loop, irregular atomic contention, RAJA::forall API)* #. **REDUCE_STRUCT** multiple reductions in a kernel, where number of reductions (6) is known at compile time *(single loop, multiple reductions, RAJA::forall API)* - * *Comm* group (directory src/comm) + * *Basic* group (directory src/basic) -- Tier 2 + + #. **INDEXLIST_3LOOP** construction of set of indices used in other kernel executions *(single loops, vendor scan implementations, RAJA::forall API)* + + * *Comm* group (directory src/comm) -- Tier 2 #. **HALO_EXCHANGE_FUSED** packing and unpacking MPI message buffers for point-to-point distributed memory halo data exchange for mesh-based codes *(overhead of launching many small kernels, GPU variants use RAJA::Workgroup concepts to execute multiple kernels with one launch)* From cee7fd9ae9dc5ce56887404f1fc23ff570e93027 Mon Sep 17 00:00:00 2001 From: Rich Hornung Date: Thu, 15 Jan 2026 10:09:12 -0800 Subject: [PATCH 07/16] Bold tier levels --- docs/13_rajaperf/rajaperf.rst | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/13_rajaperf/rajaperf.rst b/docs/13_rajaperf/rajaperf.rst index 2eb8b22..fbbf958 100644 --- a/docs/13_rajaperf/rajaperf.rst +++ b/docs/13_rajaperf/rajaperf.rst @@ -82,7 +82,7 @@ listed below. reference is a C-style sequential implementation of the kernel in a comment section near the top of the header file. - * *Apps* group (directory src/apps) -- Tier 1 + * *Apps* group (directory src/apps) -- **Tier 1** #. **DIFFUSION3DPA** action of a 3D finite element diffusion operator via partial assembly *(nested loops, GPU shared memory, RAJA::launch API)* #. **EDGE3D** stiffness matrix assembly for a 3D MHD calculation *(single loop with included function call, RAJA::forall API)* @@ -97,7 +97,7 @@ listed below. #. **NODAL_ACCUMULATION_3D** on a 3D structured hexahedral mesh, sum a constribution from each hex vertex (nodal value) to its centroid (zonal value) *(single loop, data access via indirection array, 8-way atomic contention, RAJA::forall API)* #. **VOL3D** on a 3D structured hexahedral mesh (faces are not necessarily planes), compute volume of each zone (hex) *(single loop, data access via indirection array, RAJA::forall API)* - * *Apps* group (directory src/apps) -- Tier 2 + * *Apps* group (directory src/apps) -- **Tier 2** #. **CONVECTION3DPA** action of a 3D finite element convection operator via partial assembly *(nested loops, GPU shared memory, RAJA::launch API)* #. **DEL_DOT_VEC_2D** divergence of a vector field at a set of points on a mesh *(single loop, data access via indirection array, RAJA::forall API)* @@ -105,16 +105,16 @@ listed below. #. **MASSVEC3DPA_ATOMIC** action of a 3D finite element mass matrix via partial assembly on a block vector *(nested loops, GPU shared memory, RAJA::launch API)* - * *Basic* group (directory src/basic) -- Tier 1 + * *Basic* group (directory src/basic) -- **Tier 1** #. **MULTI_REDUCE** multiple reductions in a kernel, where number of reductions is set at run time *(single loop, irregular atomic contention, RAJA::forall API)* #. **REDUCE_STRUCT** multiple reductions in a kernel, where number of reductions (6) is known at compile time *(single loop, multiple reductions, RAJA::forall API)* - * *Basic* group (directory src/basic) -- Tier 2 + * *Basic* group (directory src/basic) -- **Tier 2** #. **INDEXLIST_3LOOP** construction of set of indices used in other kernel executions *(single loops, vendor scan implementations, RAJA::forall API)* - * *Comm* group (directory src/comm) -- Tier 2 + * *Comm* group (directory src/comm) -- **Tier 2** #. **HALO_EXCHANGE_FUSED** packing and unpacking MPI message buffers for point-to-point distributed memory halo data exchange for mesh-based codes *(overhead of launching many small kernels, GPU variants use RAJA::Workgroup concepts to execute multiple kernels with one launch)* From b6ddf0a488d0448be84dbc546167c5980d9f4a31 Mon Sep 17 00:00:00 2001 From: Rich Hornung Date: Thu, 15 Jan 2026 11:01:21 -0800 Subject: [PATCH 08/16] Cleanup kernel lists and descriptions --- docs/13_rajaperf/rajaperf.rst | 29 ++++++++++++++++------------- 1 file changed, 16 insertions(+), 13 deletions(-) diff --git a/docs/13_rajaperf/rajaperf.rst b/docs/13_rajaperf/rajaperf.rst index fbbf958..a1c6ac6 100644 --- a/docs/13_rajaperf/rajaperf.rst +++ b/docs/13_rajaperf/rajaperf.rst @@ -82,39 +82,42 @@ listed below. reference is a C-style sequential implementation of the kernel in a comment section near the top of the header file. - * *Apps* group (directory src/apps) -- **Tier 1** +Tier 1 kernels +^^^^^^^^^^^^^^^ + + * *Apps* group (directory src/apps) #. **DIFFUSION3DPA** action of a 3D finite element diffusion operator via partial assembly *(nested loops, GPU shared memory, RAJA::launch API)* #. **EDGE3D** stiffness matrix assembly for a 3D MHD calculation *(single loop with included function call, RAJA::forall API)* #. **ENERGY** internal energy calculation from an explicit hydrodynamics algorithm; *(multiple single-loop operations in sequence, conditional logic for correctness checks and cutoffs, RAJA::forall API)* #. **FEMSWEEP** finite element implementation of linear sweep algorithm used in radiation transport *(nested loops, RAJA::launch API)* #. **INTSC_HEXHEX** intersection between two 24-sided hexahedra, including volume and moment calculations *(multiple single-loop operations in sequence, RAJA::forall API)* - #. **INTSC_HEXRECT** intersection between a 24-sided hexahedron and a rectangular solid, including volume and moment calculations *(single loop, RAJA::forall API)* #. **MASS3DEA** element assembly of a 3D finite element mass matrix *(nested loops, GPU shared memory, RAJA::launch API)* - #. **MASS3DPA_ATOMIC** action of a 3D finite element mass matrix via partial assembly *(nested loops, GPU shared memory, RAJA::launch API)* + #. **MASS3DPA_ATOMIC** action of a 3D finite element mass matrix on elements with shared DOFs via partial assembly *(nested loops, GPU shared memory, RAJA::launch API)* #. **MASSVEC3DPA** action of a 3D finite element mass matrix via partial assembly on a block vector *(nested loops, GPU shared memory, RAJA::launch API)* #. **MATVEC_3D_STENCIL** matrix-vector product based on a 3D mesh stencil *(single loop, data access via indirection array, RAJA::forall API)* #. **NODAL_ACCUMULATION_3D** on a 3D structured hexahedral mesh, sum a constribution from each hex vertex (nodal value) to its centroid (zonal value) *(single loop, data access via indirection array, 8-way atomic contention, RAJA::forall API)* #. **VOL3D** on a 3D structured hexahedral mesh (faces are not necessarily planes), compute volume of each zone (hex) *(single loop, data access via indirection array, RAJA::forall API)* - * *Apps* group (directory src/apps) -- **Tier 2** + +Tier 2 kernels +^^^^^^^^^^^^^^^ + + * *Apps* group (directory src/apps) #. **CONVECTION3DPA** action of a 3D finite element convection operator via partial assembly *(nested loops, GPU shared memory, RAJA::launch API)* #. **DEL_DOT_VEC_2D** divergence of a vector field at a set of points on a mesh *(single loop, data access via indirection array, RAJA::forall API)* - #. **MASS3DPA** action of a 3D finite element mass matrix via partial assembly *(nested loops, GPU shared memory, RAJA::launch API)* - #. **MASSVEC3DPA_ATOMIC** action of a 3D finite element mass matrix via partial assembly on a block vector *(nested loops, GPU shared memory, RAJA::launch API)* - + #. **INTSC_HEXRECT** intersection between a 24-sided hexahedron and a rectangular solid, including volume and moment calculations *(single loop, RAJA::forall API)* + #. **LTIMES** one step of the source-iteration technique for solving the steady-state linear Boltzmann equation, multi-dimensional matrix product *(nested loops, RAJA::kernel API)* + #. **MASS3DPA** action of a 3D finite element mass matrix on disconnected elements via partial assembly *(nested loops, GPU shared memory, RAJA::launch API)* - * *Basic* group (directory src/basic) -- **Tier 1** + * *Basic* group (directory src/basic) #. **MULTI_REDUCE** multiple reductions in a kernel, where number of reductions is set at run time *(single loop, irregular atomic contention, RAJA::forall API)* #. **REDUCE_STRUCT** multiple reductions in a kernel, where number of reductions (6) is known at compile time *(single loop, multiple reductions, RAJA::forall API)* - - * *Basic* group (directory src/basic) -- **Tier 2** - #. **INDEXLIST_3LOOP** construction of set of indices used in other kernel executions *(single loops, vendor scan implementations, RAJA::forall API)* - * *Comm* group (directory src/comm) -- **Tier 2** + * *Comm* group (directory src/comm) #. **HALO_EXCHANGE_FUSED** packing and unpacking MPI message buffers for point-to-point distributed memory halo data exchange for mesh-based codes *(overhead of launching many small kernels, GPU variants use RAJA::Workgroup concepts to execute multiple kernels with one launch)* @@ -132,7 +135,7 @@ Source code modifications Please see :ref:`GlobalRunRules` for general guidance on allowed modifications. For the RAJA Performance Suite, we define the following restrictions on source code modifications: -* RAJA Performance Suite uses RAJA as the portability library, available at https://github.com/LLNL/RAJA . While source code changes to RAJA can be proposed, RAJA in RAJA Performance Suite may not be removed or replaced with any other library. +* RAJA Performance Suite uses RAJA as the portability library, available at https://github.com/LLNL/RAJA. While source code changes to RAJA can be proposed, RAJA in RAJA Performance Suite may not be removed or replaced with any other library. Building From 53a46cde084058610d0f7f5b9c98cc71d477b8b0 Mon Sep 17 00:00:00 2001 From: Rich Hornung Date: Thu, 15 Jan 2026 11:03:11 -0800 Subject: [PATCH 09/16] Change kernel priority --- docs/13_rajaperf/rajaperf.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/13_rajaperf/rajaperf.rst b/docs/13_rajaperf/rajaperf.rst index a1c6ac6..33ea061 100644 --- a/docs/13_rajaperf/rajaperf.rst +++ b/docs/13_rajaperf/rajaperf.rst @@ -91,7 +91,7 @@ Tier 1 kernels #. **EDGE3D** stiffness matrix assembly for a 3D MHD calculation *(single loop with included function call, RAJA::forall API)* #. **ENERGY** internal energy calculation from an explicit hydrodynamics algorithm; *(multiple single-loop operations in sequence, conditional logic for correctness checks and cutoffs, RAJA::forall API)* #. **FEMSWEEP** finite element implementation of linear sweep algorithm used in radiation transport *(nested loops, RAJA::launch API)* - #. **INTSC_HEXHEX** intersection between two 24-sided hexahedra, including volume and moment calculations *(multiple single-loop operations in sequence, RAJA::forall API)* + #. **INTSC_HEXRECT** intersection between a 24-sided hexahedron and a rectangular solid, including volume and moment calculations *(single loop, RAJA::forall API)* #. **MASS3DEA** element assembly of a 3D finite element mass matrix *(nested loops, GPU shared memory, RAJA::launch API)* #. **MASS3DPA_ATOMIC** action of a 3D finite element mass matrix on elements with shared DOFs via partial assembly *(nested loops, GPU shared memory, RAJA::launch API)* #. **MASSVEC3DPA** action of a 3D finite element mass matrix via partial assembly on a block vector *(nested loops, GPU shared memory, RAJA::launch API)* @@ -107,7 +107,7 @@ Tier 2 kernels #. **CONVECTION3DPA** action of a 3D finite element convection operator via partial assembly *(nested loops, GPU shared memory, RAJA::launch API)* #. **DEL_DOT_VEC_2D** divergence of a vector field at a set of points on a mesh *(single loop, data access via indirection array, RAJA::forall API)* - #. **INTSC_HEXRECT** intersection between a 24-sided hexahedron and a rectangular solid, including volume and moment calculations *(single loop, RAJA::forall API)* + #. **INTSC_HEXHEX** intersection between two 24-sided hexahedra, including volume and moment calculations *(multiple single-loop operations in sequence, RAJA::forall API)* #. **LTIMES** one step of the source-iteration technique for solving the steady-state linear Boltzmann equation, multi-dimensional matrix product *(nested loops, RAJA::kernel API)* #. **MASS3DPA** action of a 3D finite element mass matrix on disconnected elements via partial assembly *(nested loops, GPU shared memory, RAJA::launch API)* From 374a875a25f82ce26ea598d92daae9b825455130 Mon Sep 17 00:00:00 2001 From: Rich Hornung Date: Thu, 15 Jan 2026 15:01:02 -0800 Subject: [PATCH 10/16] More rework kernel section --- docs/13_rajaperf/rajaperf.rst | 24 +++++++++++++----------- 1 file changed, 13 insertions(+), 11 deletions(-) diff --git a/docs/13_rajaperf/rajaperf.rst b/docs/13_rajaperf/rajaperf.rst index 33ea061..79800c0 100644 --- a/docs/13_rajaperf/rajaperf.rst +++ b/docs/13_rajaperf/rajaperf.rst @@ -74,42 +74,44 @@ performance experiment. Problems -------- -The RAJA Performance Suite benchmark is limited to a subset of kernels -listed below. +The RAJA Performance Suite benchmark is limited to a subset of kernels in the +full Suite to focus on some of the more important computational patterns found +in LLNL applications. The subset of kernels is listed below, which contains +a brief description of each kernel and its most prominant features. -.. note:: Each kernel contains a reference description which is located in the +.. note:: Each kernel contains a complete reference description located in the header file for the kernel object ``.hpp``. The reference is a C-style sequential implementation of the kernel in a comment section near the top of the header file. -Tier 1 kernels -^^^^^^^^^^^^^^^ +Priority 1 kernels +^^^^^^^^^^^^^^^^^^^ * *Apps* group (directory src/apps) - #. **DIFFUSION3DPA** action of a 3D finite element diffusion operator via partial assembly *(nested loops, GPU shared memory, RAJA::launch API)* + #. **DIFFUSION3DPA** element-wise action of a 3D finite element volume diffusion operator via partial assembly *(nested loops, GPU shared memory, RAJA::launch API)* #. **EDGE3D** stiffness matrix assembly for a 3D MHD calculation *(single loop with included function call, RAJA::forall API)* #. **ENERGY** internal energy calculation from an explicit hydrodynamics algorithm; *(multiple single-loop operations in sequence, conditional logic for correctness checks and cutoffs, RAJA::forall API)* #. **FEMSWEEP** finite element implementation of linear sweep algorithm used in radiation transport *(nested loops, RAJA::launch API)* #. **INTSC_HEXRECT** intersection between a 24-sided hexahedron and a rectangular solid, including volume and moment calculations *(single loop, RAJA::forall API)* #. **MASS3DEA** element assembly of a 3D finite element mass matrix *(nested loops, GPU shared memory, RAJA::launch API)* #. **MASS3DPA_ATOMIC** action of a 3D finite element mass matrix on elements with shared DOFs via partial assembly *(nested loops, GPU shared memory, RAJA::launch API)* - #. **MASSVEC3DPA** action of a 3D finite element mass matrix via partial assembly on a block vector *(nested loops, GPU shared memory, RAJA::launch API)* + #. **MASSVEC3DPA** element-wise action of a 3D finite element mass matrix via partial assembly on a block vector *(nested loops, GPU shared memory, RAJA::launch API)* #. **MATVEC_3D_STENCIL** matrix-vector product based on a 3D mesh stencil *(single loop, data access via indirection array, RAJA::forall API)* #. **NODAL_ACCUMULATION_3D** on a 3D structured hexahedral mesh, sum a constribution from each hex vertex (nodal value) to its centroid (zonal value) *(single loop, data access via indirection array, 8-way atomic contention, RAJA::forall API)* #. **VOL3D** on a 3D structured hexahedral mesh (faces are not necessarily planes), compute volume of each zone (hex) *(single loop, data access via indirection array, RAJA::forall API)* -Tier 2 kernels -^^^^^^^^^^^^^^^ +Priority 2 kernels +^^^^^^^^^^^^^^^^^^^ * *Apps* group (directory src/apps) - #. **CONVECTION3DPA** action of a 3D finite element convection operator via partial assembly *(nested loops, GPU shared memory, RAJA::launch API)* + #. **CONVECTION3DPA** element-wise action of a 3D finite element volume convection operator via partial assembly *(nested loops, GPU shared memory, RAJA::launch API)* #. **DEL_DOT_VEC_2D** divergence of a vector field at a set of points on a mesh *(single loop, data access via indirection array, RAJA::forall API)* #. **INTSC_HEXHEX** intersection between two 24-sided hexahedra, including volume and moment calculations *(multiple single-loop operations in sequence, RAJA::forall API)* #. **LTIMES** one step of the source-iteration technique for solving the steady-state linear Boltzmann equation, multi-dimensional matrix product *(nested loops, RAJA::kernel API)* - #. **MASS3DPA** action of a 3D finite element mass matrix on disconnected elements via partial assembly *(nested loops, GPU shared memory, RAJA::launch API)* + #. **MASS3DPA** element-wise action of a 3D finite element mass matrix via partial assembly *(nested loops, GPU shared memory, RAJA::launch API)* * *Basic* group (directory src/basic) From 20b87b30598b3b5b51e4920e7029b307834e4f2d Mon Sep 17 00:00:00 2001 From: Rich Hornung Date: Thu, 15 Jan 2026 15:29:22 -0800 Subject: [PATCH 11/16] Add minimal content to strong/weak scaling sections. --- docs/13_rajaperf/rajaperf.rst | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/docs/13_rajaperf/rajaperf.rst b/docs/13_rajaperf/rajaperf.rst index 79800c0..96e6c19 100644 --- a/docs/13_rajaperf/rajaperf.rst +++ b/docs/13_rajaperf/rajaperf.rst @@ -163,12 +163,16 @@ Memory Usage Strong Scaling on El Capitan ============================ -Please see :ref:`ElCapitanSystemDescription` for El Capitan system description. +The RAJA Performance Suite is primarily a single-node and compiler assessment +tool. Thus, strong scaling is not part of the benchmark. Weak Scaling on El Capitan ========================== +The RAJA Performance Suite is primarily a single-node and compiler assessment +tool. Thus, weak scaling is not part of the benchmark. + References ========== From b47e400e51d8a2b45b6b5bdff280dcd89692bfa1 Mon Sep 17 00:00:00 2001 From: Rich Hornung Date: Fri, 16 Jan 2026 11:50:15 -0800 Subject: [PATCH 12/16] Improve kernel descriptions --- docs/13_rajaperf/rajaperf.rst | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/13_rajaperf/rajaperf.rst b/docs/13_rajaperf/rajaperf.rst index 96e6c19..62cca82 100644 --- a/docs/13_rajaperf/rajaperf.rst +++ b/docs/13_rajaperf/rajaperf.rst @@ -89,14 +89,14 @@ Priority 1 kernels * *Apps* group (directory src/apps) - #. **DIFFUSION3DPA** element-wise action of a 3D finite element volume diffusion operator via partial assembly *(nested loops, GPU shared memory, RAJA::launch API)* + #. **DIFFUSION3DPA** element-wise action of a 3D finite element volume diffusion operator via partial assembly and sum factorization *(nested loops, GPU shared memory, RAJA::launch API)* #. **EDGE3D** stiffness matrix assembly for a 3D MHD calculation *(single loop with included function call, RAJA::forall API)* #. **ENERGY** internal energy calculation from an explicit hydrodynamics algorithm; *(multiple single-loop operations in sequence, conditional logic for correctness checks and cutoffs, RAJA::forall API)* #. **FEMSWEEP** finite element implementation of linear sweep algorithm used in radiation transport *(nested loops, RAJA::launch API)* #. **INTSC_HEXRECT** intersection between a 24-sided hexahedron and a rectangular solid, including volume and moment calculations *(single loop, RAJA::forall API)* #. **MASS3DEA** element assembly of a 3D finite element mass matrix *(nested loops, GPU shared memory, RAJA::launch API)* - #. **MASS3DPA_ATOMIC** action of a 3D finite element mass matrix on elements with shared DOFs via partial assembly *(nested loops, GPU shared memory, RAJA::launch API)* - #. **MASSVEC3DPA** element-wise action of a 3D finite element mass matrix via partial assembly on a block vector *(nested loops, GPU shared memory, RAJA::launch API)* + #. **MASS3DPA_ATOMIC** action of a 3D finite element mass matrix on elements with shared DOFs via partial assembly and sum factorization *(nested loops, GPU shared memory, RAJA::launch API)* + #. **MASSVEC3DPA** element-wise action of a 3D finite element mass matrix via partial assembly and sum factorization on a block vector *(nested loops, GPU shared memory, RAJA::launch API)* #. **MATVEC_3D_STENCIL** matrix-vector product based on a 3D mesh stencil *(single loop, data access via indirection array, RAJA::forall API)* #. **NODAL_ACCUMULATION_3D** on a 3D structured hexahedral mesh, sum a constribution from each hex vertex (nodal value) to its centroid (zonal value) *(single loop, data access via indirection array, 8-way atomic contention, RAJA::forall API)* #. **VOL3D** on a 3D structured hexahedral mesh (faces are not necessarily planes), compute volume of each zone (hex) *(single loop, data access via indirection array, RAJA::forall API)* @@ -107,11 +107,11 @@ Priority 2 kernels * *Apps* group (directory src/apps) - #. **CONVECTION3DPA** element-wise action of a 3D finite element volume convection operator via partial assembly *(nested loops, GPU shared memory, RAJA::launch API)* + #. **CONVECTION3DPA** element-wise action of a 3D finite element volume convection operator via partial assembly and sum factorization *(nested loops, GPU shared memory, RAJA::launch API)* #. **DEL_DOT_VEC_2D** divergence of a vector field at a set of points on a mesh *(single loop, data access via indirection array, RAJA::forall API)* #. **INTSC_HEXHEX** intersection between two 24-sided hexahedra, including volume and moment calculations *(multiple single-loop operations in sequence, RAJA::forall API)* #. **LTIMES** one step of the source-iteration technique for solving the steady-state linear Boltzmann equation, multi-dimensional matrix product *(nested loops, RAJA::kernel API)* - #. **MASS3DPA** element-wise action of a 3D finite element mass matrix via partial assembly *(nested loops, GPU shared memory, RAJA::launch API)* + #. **MASS3DPA** element-wise action of a 3D finite element mass matrix via partial assembly and sum factorization *(nested loops, GPU shared memory, RAJA::launch API)* * *Basic* group (directory src/basic) From 2841f64ca16874643ac601d400f1687e18206198 Mon Sep 17 00:00:00 2001 From: Rich Hornung Date: Fri, 16 Jan 2026 14:16:17 -0800 Subject: [PATCH 13/16] Fill in more sections --- docs/13_rajaperf/rajaperf.rst | 183 +++++++++++++++++++++++++++++----- 1 file changed, 160 insertions(+), 23 deletions(-) diff --git a/docs/13_rajaperf/rajaperf.rst b/docs/13_rajaperf/rajaperf.rst index 62cca82..99f6dcb 100644 --- a/docs/13_rajaperf/rajaperf.rst +++ b/docs/13_rajaperf/rajaperf.rst @@ -2,7 +2,9 @@ RAJA Performance Suite ********************** -RAJA Performance Suite source code is near-final at this point. The problems to run are not yet finalized. +RAJA Performance Suite source code is near-final at this point. It will be +released soon along with benchmark baseline data and instructions for running +the benchmark and generating evaluation metrics. The RAJA Performance Suite contains a variety of numerical kernels that represent important computational patterns found in HPC applications. It is a @@ -19,7 +21,8 @@ available at: * `RAJA GitHub project `_ .. important:: The RAJA Performance Suite benchmark is limited to a subset of - kernels in the RAJA Performance Suite as described below. + kernels in the RAJA Performance Suite as described in + :ref:`rajaperf_problems-label`. Purpose @@ -41,43 +44,55 @@ ability to optimize. Often, kernels in the Suite serve as collaboration tools enabling the RAJA team to work with vendors to resolve performance issues observed in production applications that use RAJA. -RAJA is a potential *X* in the commonly used *MPI + X* parallel application +To more closely align execution of kernels in the Suite with how they would +run in the context of a full application, benchmark runs must be done using +multiple MPI ranks to ensure that all resources on a compute node are being +exercised and avoid misrepresentation of kernel and node performance. RAJA is +a potential *X* in the often referred to *MPI + X* parallel application paradigm, where MPI is used for coarse-grained, distributed memory parallelism -and X (e.g., RAJA) supports fine-grained parallelism within an MPI rank. -The RAJA Performance Suite can be configured with MPI so that execution of -kernels in the Suite is representative of the ways in which numerical kernels -are exercised in an MPI + X HPC applications. When the RAJA Performance Suite -is run using multiple MPI ranks, the same kernel code is executed on each rank. -Synchronization and communication across ranks involves only sending execution -timing information to rank zero. +and X (e.g., RAJA) supports fine-grained parallelism within an MPI rank. The +RAJA Performance Suite can be configured with MPI so that execution of kernels +in the Suite represents how those kernels would be exercised in an MPI + X HPC +application. When the RAJA Performance Suite is run using multiple MPI ranks, +the same kernel code is executed on each rank. Synchronization and +communication across ranks involves only sending execution timing information +to rank zero for reporting purposes. .. important:: For RAJA Performance Suite benchmark execution, MPI must be used - to ensure that all resources on a compute node are being - exercised and avoid misrepresenting node performance. + to run to ensure that all resources on a compute node are being + exercised and avoid misrepresentation of kernel and node + performance. This is described in the instructions provided in + :ref:`raja-perf_run-label`. Characteristics =============== The `RAJA Performance Suite GitHub project `_ -contains the code for all the Suite kernels and all of essential software +contains the code for all the Suite kernels and all essential external software dependencies in Git submodules. Thus, dependency versions are pinned to each version of the Suite. Building the Suite requires an installation of CMake for configuring a build, a C++17 compliant compiler to build the code, and an MPI -library installation, if MPI is to be used. +library installation when MPI is to be used. The Suite can be run in a myriad of ways by specifying parameters and options -in its command-line interface. The intent is that one can build the code and -use scripts to execute a series of Suite runs to generate data for each desired +as command-line arguments. The intent is that one can build the code and +use scripts to execute multiple Suite runs to generate data for a desired performance experiment. +In particular, variants, problem sizes, etc. for the kernels can be set by a +user from the command line. Specific instructions for running the RAJA +Performance Suite benchmark are described in :ref:`raja-perf_run-label`. + + +.. _rajaperf_problems-label: + Problems -------- The RAJA Performance Suite benchmark is limited to a subset of kernels in the full Suite to focus on some of the more important computational patterns found -in LLNL applications. The subset of kernels is listed below, which contains -a brief description of each kernel and its most prominant features. +in LLNL applications. The subset of kernels is described. .. note:: Each kernel contains a complete reference description located in the header file for the kernel object ``.hpp``. The @@ -98,7 +113,7 @@ Priority 1 kernels #. **MASS3DPA_ATOMIC** action of a 3D finite element mass matrix on elements with shared DOFs via partial assembly and sum factorization *(nested loops, GPU shared memory, RAJA::launch API)* #. **MASSVEC3DPA** element-wise action of a 3D finite element mass matrix via partial assembly and sum factorization on a block vector *(nested loops, GPU shared memory, RAJA::launch API)* #. **MATVEC_3D_STENCIL** matrix-vector product based on a 3D mesh stencil *(single loop, data access via indirection array, RAJA::forall API)* - #. **NODAL_ACCUMULATION_3D** on a 3D structured hexahedral mesh, sum a constribution from each hex vertex (nodal value) to its centroid (zonal value) *(single loop, data access via indirection array, 8-way atomic contention, RAJA::forall API)* + #. **NODAL_ACCUMULATION_3D** on a 3D structured hexahedral mesh, sum a contribution from each hex vertex (nodal value) to its centroid (zonal value) *(single loop, data access via indirection array, 8-way atomic contention, RAJA::forall API)* #. **VOL3D** on a 3D structured hexahedral mesh (faces are not necessarily planes), compute volume of each zone (hex) *(single loop, data access via indirection array, RAJA::forall API)* @@ -123,42 +138,164 @@ Priority 2 kernels #. **HALO_EXCHANGE_FUSED** packing and unpacking MPI message buffers for point-to-point distributed memory halo data exchange for mesh-based codes *(overhead of launching many small kernels, GPU variants use RAJA::Workgroup concepts to execute multiple kernels with one launch)* - + +.. _rajaperf_fom-label: Figure of Merit --------------- +There are two figures of merit (FOM) for each benchmark kernel: execution time +and memory bandwidth..... + +Describe how to determine problem size and how key output quantities are +computed..... +.. _rajaperf_codemod-label: + Source code modifications ========================= Please see :ref:`GlobalRunRules` for general guidance on allowed modifications. -For the RAJA Performance Suite, we define the following restrictions on source code modifications: +For the RAJA Performance Suite, we define the following restrictions on source +code modifications: + +* While source code changes to the RAJA Performance Suite kernels and to RAJA + can be proposed, RAJA may not be removed from *RAJA kernel variants* in the + Suite or replaced with any other library. The *Base kernel variants* in the + Suite are provided to show how each kernel could be implemented directly + in the corresponding programming model back-end without the RAJA abstraction + layer. Apart from some special cases, the RAJA and Base variants of each + kernel should perform the same computation. -* RAJA Performance Suite uses RAJA as the portability library, available at https://github.com/LLNL/RAJA. While source code changes to RAJA can be proposed, RAJA in RAJA Performance Suite may not be removed or replaced with any other library. +.. _rajaperf_build-label: Building ======== +The RAJA Performance Suite uses a CMake-based system to configure the code for +compilation. As noted earlier, all non-system related software dependencies are +included in the RAJA Performance Suite repository as Git submodules. + +The current RAJA Performance Suite benchmark uses the ``v2025.12.0`` version of +the code. When the git repository is cloned, you will be on the ``develop`` +branch, which is the default RAJA Performance Suite branch. To get a local copy +of this version of the code and the correct versions of submodules:: + + $ git clone --recursive https://github.com/LLNL/RAJAPerf.git + $ git checkout v2025.12.0 + $ git submodule update --init --recursive + +When building the RAJA Performance Suite, RAJA and the RAJA Performance Suite +are built together using the same CMake configuration. The basic process for +specifying a configuration and generating a build space is to create a build +directory and run CMake in it with the proper options. For example:: + + $ pwd + path/to/RAJAPerf + $ mkdir my-build + $ cd my-build + $ cmake .. + $ make -j (or make -j to build with a specified number of cores) + +For convenience and informational purposes, configuration scripts are maintained +in the ``RAJAPerf/scripts`` subdirectories for various build configurations. +For example, the ``RAJAPerf/scripts/lc-builds`` directory contains scripts that +can be used to generate build configurations for machines in the Livermore +Computing (LC) Center at Lawrence Livermore National Laboratory. These scripts +are to be run in the top-level RAJAPerf directory. Each script creates a +descriptively-named build space directory and runs CMake with a configuration +appropriate for the platform and specified compiler(s) indicated by the build +script name. For example, to build the code to generate baseline data on the +El Capitan system:: + + $ pwd + path/to/RAJAPerf + $ ./scripts/lc-builds/toss4_cray-mpich_amdclang.sh 9.0.1 6.4.3 gfx942 + $ build_lc_toss4-cray-mpich-9.0.1-amdclang-6.4.3-gfx942 + $ make -j + +This will build the code for CPU-GPU execution using the system-installed +version 9.0.1 of the Cray MPICH MPI library with the version 6.4.3 of the AMD +clang compiler (ROCm version 6.4.3) targeting GPU compute architecture gfx942, +which is appropriate for the AMD MI300A APU hardware on El Capitan. Please +consult the build script files in the ``RAJAPerf/scripts/lc-builds`` directory +for hints at building the code for other architectures and compilers. +Additional information on build configurations is described in the +`RAJA Performance Suite User Guide `_ for the version of the code in which you are interested. + + +.. _rajaperf_run-label: Running ======= +After the code is built, the executable will located in the ``bin`` directory +of the build space. Continuing the El Capitan example above:: + + $ pwd + path/to/build_lc_toss4-cray-mpich-9.0.1-amdclang-6.4.3-gfx942 + $ ls bin + rajaperf.exe + +To get usage information:: + + $ path/to/rajaperf.exe --help (or -h) + +This command will print all available command-line options along with potential +arguments and defaults. Options are avail to print information about the Suite, +to select output directory and file details, to select kernels and variants to +run, and how they are run (problem sizes, # times each kernel is run, data +spaces to use for array allocation, etc.). All arguments are optional. If no +arguments are specified, the suite will run all kernels in their default +configurations for the variants that are available for the way the code +is configured to build. + +The script to run the benchmark for generating baselines for EL Capitan is +described in :ref:`rajaperf_results-label`. A similar recipe should be followed +for benchmarking other systems. + + +.. _rajaperf_validation-label: Validation ========== +Each kernel and variant run generates a checksum value based on kernel execution +output, such as an output data array computed by the kernel. The checksum +depends on the problem size run for the kernel; thus, each checksum is +computed at run time. Validation criteria is defined in terms of the checksum +difference between each kernel variant and problem size run and a corresponding +reference variant. Typically, the ``Base_Seq`` variant is used to define the +reference checksum and so that variant should be run for each kernel as part of +a performance study. Each kernel is annotated in the source code as to whether +the checksum for each variant is expected to match the reference checksum +exactly, or to be within some tolerance due to order of operation differences +when run in parallel. + +Whether the checksum for each kernel is considered to be within its expected +tolerance is reported as checksum ``PASSED`` or ``FAILED`` in the output files. -Example Scalability Results +**Show an example of this for the EL Capitan baseline runs!!** + +.. _rajaperf_results-label: + +Example Benchmark Results =========================== +**Include tables of results of El Capitan baseline results** + + +.. _rajaperf_memory-label: Memory Usage ============ +**Do we need to say anything here, if we describe how benchmark problem size +is set in the benchmark results section above???** + Strong Scaling on El Capitan ============================ From 57e7133daddbdb446ecab9310c521e9c68d31bf7 Mon Sep 17 00:00:00 2001 From: Rich Hornung Date: Fri, 16 Jan 2026 14:59:05 -0800 Subject: [PATCH 14/16] Change kernel --- docs/13_rajaperf/rajaperf.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/13_rajaperf/rajaperf.rst b/docs/13_rajaperf/rajaperf.rst index 99f6dcb..2ac9c68 100644 --- a/docs/13_rajaperf/rajaperf.rst +++ b/docs/13_rajaperf/rajaperf.rst @@ -136,7 +136,7 @@ Priority 2 kernels * *Comm* group (directory src/comm) - #. **HALO_EXCHANGE_FUSED** packing and unpacking MPI message buffers for point-to-point distributed memory halo data exchange for mesh-based codes *(overhead of launching many small kernels, GPU variants use RAJA::Workgroup concepts to execute multiple kernels with one launch)* + #. **HALO_PACKING_FUSED** packing and unpacking MPI message buffers for point-to-point distributed memory halo data exchange for mesh-based codes *(overhead of launching many small kernels, GPU variants use RAJA::Workgroup concepts to execute multiple kernels with one launch)* .. _rajaperf_fom-label: From d4cef8ba106e3d21c5e49a11c0d81baa4c2c9bc4 Mon Sep 17 00:00:00 2001 From: Rich Hornung Date: Fri, 16 Jan 2026 15:17:00 -0800 Subject: [PATCH 15/16] address review comments --- docs/13_rajaperf/rajaperf.rst | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/13_rajaperf/rajaperf.rst b/docs/13_rajaperf/rajaperf.rst index 2ac9c68..e4587ba 100644 --- a/docs/13_rajaperf/rajaperf.rst +++ b/docs/13_rajaperf/rajaperf.rst @@ -268,7 +268,7 @@ output, such as an output data array computed by the kernel. The checksum depends on the problem size run for the kernel; thus, each checksum is computed at run time. Validation criteria is defined in terms of the checksum difference between each kernel variant and problem size run and a corresponding -reference variant. Typically, the ``Base_Seq`` variant is used to define the +reference variant. The ``Base_Seq`` variant is used to define the reference checksum and so that variant should be run for each kernel as part of a performance study. Each kernel is annotated in the source code as to whether the checksum for each variant is expected to match the reference checksum @@ -280,6 +280,8 @@ tolerance is reported as checksum ``PASSED`` or ``FAILED`` in the output files. **Show an example of this for the EL Capitan baseline runs!!** +**Reminder: add more accurate Base_Seq summation tunings (left fold is inaccurate for large problem sizes).** + .. _rajaperf_results-label: Example Benchmark Results From df88b9345c67be944e53c5ccb49845fbadfb81be Mon Sep 17 00:00:00 2001 From: Rich Hornung Date: Fri, 16 Jan 2026 15:49:32 -0800 Subject: [PATCH 16/16] Fix some links --- docs/13_rajaperf/rajaperf.rst | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/docs/13_rajaperf/rajaperf.rst b/docs/13_rajaperf/rajaperf.rst index e4587ba..3d3c15b 100644 --- a/docs/13_rajaperf/rajaperf.rst +++ b/docs/13_rajaperf/rajaperf.rst @@ -62,7 +62,7 @@ to rank zero for reporting purposes. to run to ensure that all resources on a compute node are being exercised and avoid misrepresentation of kernel and node performance. This is described in the instructions provided in - :ref:`raja-perf_run-label`. + :ref:`rajaperf_run-label`. Characteristics @@ -82,7 +82,7 @@ performance experiment. In particular, variants, problem sizes, etc. for the kernels can be set by a user from the command line. Specific instructions for running the RAJA -Performance Suite benchmark are described in :ref:`raja-perf_run-label`. +Performance Suite benchmark are described in :ref:`rajaperf_run-label`. .. _rajaperf_problems-label: @@ -145,10 +145,9 @@ Figure of Merit --------------- There are two figures of merit (FOM) for each benchmark kernel: execution time -and memory bandwidth..... +and memory bandwidth..... **fill this in*** -Describe how to determine problem size and how key output quantities are -computed..... +**Describe how to set problem size based on architecture and how key output quantities are computed.....*** @@ -166,7 +165,7 @@ code modifications: Suite or replaced with any other library. The *Base kernel variants* in the Suite are provided to show how each kernel could be implemented directly in the corresponding programming model back-end without the RAJA abstraction - layer. Apart from some special cases, the RAJA and Base variants of each + layer. Apart from some special cases, the RAJA and Base variants for each kernel should perform the same computation.