Hi @jwyang In your paper, you said that
since it’s
not making use of the highly optimized matrix multiplication libraries in CUDA, its speed is still slow in
practice.
The implementation using the customized CUDA kernel is about 20% faster
than the full attention in the same setting, while achieving
the theoretical memory complexity. The sliding-chunk approach is the fastest, which is 60% faster than the full attention with a cost of consuming 20% more memory than the
theoretical complexity.
Therefore, your code only contains the implementations of the sliding-chunk approach. However, have you ever tried to implement or generate CUDA optimized kernels based on new arches(sm75+)? In my opinion, introducing tensor-core instructions and highly optimized GEMM libraries(CUBLAS,etc) can improve the performance of longformer.