Skip to content

Optimize GPU code #3

@bentsherman

Description

@bentsherman

The matrix library has basic GPU capabilities in that it can use CUBLAS and CUSOLVER in place of CBLAS and LAPACKE when a GPU is available, but I don't think these libraries provide enough on their own to obtain maximum performance with a GPU. There are still a few operations that haven't been implemented with BLAS / LAPACKE, so unless I can figure out a way to do so, the matrix library has to keep host and device memory in sync at all times, which hinders performance. I think there are a few things that can be done to maximize performance:

  • Implement custom CUDA kernels for operations which cannot be done entirely with BLAS / LAPACKE. That way, all internal memory transfers can be removed from the matrix library and delegated to the user, based on when data needs to be printed or saved, etc.
  • Use CUDA streams for CUBLAS, CUSOLVER, and all memory transfers and custom kernels. Especially once all intermediate operations can be done on the GPU, using streams should increase GPU bandwidth.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions