-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Labels
Description
Although algorithm (static) class templates should not care about where computation is performed (CPU or GPU), I think there are a few design choices that motivate parameterizing the algorithm itself instead of the matrix class. However, there are still reasons for parameterizing the matrix class, one of which is because polymorphic data containers (kokkos for example) do the same thing, and such data types should be able to plug into distributed-memory algorithms without any pain.
Think about the pros and cons here.
Three policy classes for offloading (just gemm for now) include:
NoOffload(default)OffloadKeepDataResident(keep data on GPU as much as possible. Any communication of data on GPU is not a problem, but remember that it still must pass through PCI bus, exploit pinned memory via buffer allocated at the beginning of the program that is used repeatedly).OffloadTransferData(make not attempt to keep data resident on GPUs. Offload for eachgemminvocation. Mainly a sanity-check policy class)
- Don't forget to Incorporate into
validateclass templates as well. - Modify all test.cpp files to initially allocate memory on device.
- Modify all Makefiles to use nvcc compiler and corresponding flags. Note that anything compiled with
nvccmust be separate from anything compiled withMPI - Update
blasdirectory to allowcuBlasheaders. - Update bench/ files to include Offload policy class
- Replace any
syrkortrmmcalls withgemm? Will that interfere with algorithm-specific policies (via non-orthogonal policy classes)?
Reactions are currently unavailable