Aule Attention is a cross-platform implementation of FlashAttention-2 that uses Triton for NVIDIA and ROCm/Linux and Vulkan for other platforms. It has a PyTorch SDPA compatibility layer which should faciliate Diffusers integration without major issues.
It may be possible to use native FA2 kernels for NVIDIA devices by integrating the new kernels library, with Aule Attention as the backup option for other GPU types and when running offline.