Skip to content

Would you release your implementations of CUDA optimized kernels using TVM?  #2

@TengFeiHan0

Description

@TengFeiHan0

Hi @jwyang In your paper, you said that

since it’s
not making use of the highly optimized matrix multiplication libraries in CUDA, its speed is still slow in
practice.

The implementation using the customized CUDA kernel is about 20% faster
than the full attention in the same setting, while achieving
the theoretical memory complexity. The sliding-chunk approach is the fastest, which is 60% faster than the full attention with a cost of consuming 20% more memory than the
theoretical complexity.

Therefore, your code only contains the implementations of the sliding-chunk approach. However, have you ever tried to implement or generate CUDA optimized kernels based on new arches(sm75+)? In my opinion, introducing tensor-core instructions and highly optimized GEMM libraries(CUBLAS,etc) can improve the performance of longformer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions