-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Currently, laddu is automatically parallelized over as many cores as desired...for one CPU. In high-performance computing, multiple CPU clusters and GPUs are often used to speed up analyses. Some of my preliminary benchmarks show laddu as 25% slower than AmpTools on a single core, so there may also be work needed to improve cache locality and general performance. However, use of MPI (message passing interface) libraries could at least allow laddu to operate on the same playing field as AmpTools.
I would recommend using rsmpi for MPI bindings. On the GPU side, a useful resource is here, although I am unsure which crate I'd recommend here. The design goal would be to make as few modifications to existing amplitude code as possible for MPI, and allowing minimal changes to amplitude code for GPU support.