Replication of the nekRS microbenchmark "weightedInnerProdBenchmark" using CUDA.
The benchmark performs the following steps:
- Weighted dot product kernel reducing three input arrays over thread blocks
- Loop on host to reduce the Nblocks-sized array to a scalar value
- MPI_Allreduce to sum the scalar values across all ranks
Informations on how to run the bechmark are included in the slurm script submit.slm.