Would you release your implementations of CUDA optimized kernels using TVM? 

Hi @jwyang In your paper, you said that 

> since it’s
> not making use of the highly optimized matrix multiplication libraries in CUDA, its speed is still slow in
> practice. 
> 

> The implementation using the customized CUDA kernel is about 20% faster
than the full attention in the same setting, while achieving
the theoretical memory complexity. The sliding-chunk approach is the fastest, which is 60% faster than the full attention with a cost of consuming 20% more memory than the
theoretical complexity. 

Therefore, your code only contains the implementations of the sliding-chunk approach. However, have you ever tried to implement or generate  CUDA optimized kernels based on new arches(sm75+)?  In my opinion, introducing tensor-core instructions and highly optimized GEMM libraries(CUBLAS,etc) can improve the performance of longformer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Would you release your implementations of CUDA optimized kernels using TVM? #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Would you release your implementations of CUDA optimized kernels using TVM? #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions