flash-attention-v2-RDNA3-minimal

a simple Flash Attention v2 implementation with ROCM (RDNA3 GPU, roc wmma), mainly used for stable diffusion(ComfyUI) in Windows ZLUDA environments.

Build and Test

minimum integration:

──rocwmma_fattn
   │  FlashAttn.py
   │  host.cpp
   │  kernel_bf16.cu
   │  kernel_fp16.cu
   └─ zluda_hijack_torch_hip_ext.py

Linux with rocm:

run test: python bench_with_sdpa.py

Windows with zluda

Need MSVC Compiler, AMD HIP SDK and rocWMMA Library.

Install rocwmma library: https://github.com/ROCm/rocWMMA

clone it and copy library/include/rocwmma to HIP SDK installation path of include folder

In cmd.exe, run vcvars64.bat to active MSVC Environment, then run zluda -- python bench_with_sdpa.py

Pre-build Extension

Tested work with PyTorch 2.2.1 + cu118 windows zluda, gfx1100 GPU

comfyui: https://github.com/Repeerc/ComfyUI-flash-attention-rdna3-win-zluda

webui: https://github.com/Repeerc/sd-webui-flash-attention-zluda-win

To do

Benchmark

OS: Windows 11

GPU: 7900xtx (gfx1100)

PyTorch 2.2.1 + CU118 ZLUDA, Python 3.10, HIP 5.7.1

FP16, causal = False

Triton build from: https://github.com/triton-lang/triton

git hash: 47fc046ff29c9ea2ee90e987c39628a540603c8f

test use Triton windows pre-build version: https://github.com/Repeerc/triton-windows-amdgpu

Triton offcial version use 06-fused-attention.py

CK-based(Composable Kernel) flash attention version compiled from: https://github.com/ROCm/flash-attention/tree/howiejay/navi_support

CK-based flash attention windows porting: https://github.com/Repeerc/flash-attn-composable-kernel-gfx110x-windows-port

seqlen with 32x aligened

[B N H D] format rearrange and contiguous to [B H N D]

seqlen without 32x aligened

[B N H D] format rearrange and contiguous to [B H N D]

fwd+bwd

FP16, causal = True

Performance in Stable Diffusion (ComfyUI)

OS: Windows 11

GPU: 7900xtx (gfx1100)

PyTorch 2.2.1 + CU118 ZLUDA, Python 3.10

Sampler: Euler

SD 1.5	PyTorch SDPA	Flash Attn minimal
512x512x1	17.32 it/s	19.20 it/s	+10%
VRAM	3.2 GB	2.3 GB
--	--	--	--
512x512x4	4.96 it/s	5.47 it/s	+10%
VRAM	5.4 GB	2.5 GB
--	--	--	--
1024x1024x1	2.52it/s	3.53it/s	+40%
VRAM	10.7 GB	2.9 GB

SDXL	PyTorch SDPA	Flash Attn minimal
1536x1024x1	2.03 it/s	2.35 it/s	+16%
VRAM	7.4 GB	6.8 GB
--	--	--	--
1024x1024x1	3.30 it/s	3.60 it/s	+9%
VRAM	6.5 GB	6.4 GB

SDXL U-Net Lora training

unet_lr = 0.0001
lr_scheduler = "constant"
lr_warmup_steps = 0
optimizer_type = "AdamW"
network_dim = 32
network_alpha = 32
seed = 1337
mixed_precision = "fp16"
full_fp16 = false
full_bf16 = false
fp8_base = true
no_half_vae = false

SDXL	PyTorch SDPA	Flash Attn minimal
1024x1024x1	1.27 it/s	1.76 it/s	+39 %
VRAM	21.5 GB	16.8 GB

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
ck_fttn		ck_fttn
gemm_test		gemm_test
rocwmma_fattn		rocwmma_fattn
.gitignore		.gitignore
GPU_peak_perf_test.cu		GPU_peak_perf_test.cu
GPU_peak_perf_test.py		GPU_peak_perf_test.py
LICENSE		LICENSE
README.md		README.md
RGP_Capture.py		RGP_Capture.py
bench_with_ck.py		bench_with_ck.py
bench_with_ck_BNHD.py		bench_with_ck_BNHD.py
bench_with_ck_bf16_BNHD.py		bench_with_ck_bf16_BNHD.py
bench_with_sdpa.py		bench_with_sdpa.py
bench_with_sdpa_BNHD.py		bench_with_sdpa_BNHD.py
bench_with_sdpa_bf16.py		bench_with_sdpa_bf16.py
bench_with_sdpa_bf16_BNHD.py		bench_with_sdpa_bf16_BNHD.py
bench_with_triton.py		bench_with_triton.py
bench_with_triton_ck.py		bench_with_triton_ck.py
bench_with_triton_ck_BNHD.py		bench_with_triton_ck_BNHD.py
bench_with_triton_ck_BNHD_linux.py		bench_with_triton_ck_BNHD_linux.py
bench_with_triton_ck_linux.py		bench_with_triton_ck_linux.py
brbcCalc.xlsx		brbcCalc.xlsx
dummy.cpp		dummy.cpp
precision_test.py		precision_test.py
precision_test_fp32ver.py		precision_test_fp32ver.py
precision_test_triton.py		precision_test_triton.py
pure_torch_ver.py		pure_torch_ver.py
sdpa_test.py		sdpa_test.py
test_arrange.py		test_arrange.py
triton_fused_attention.py		triton_fused_attention.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

flash-attention-v2-RDNA3-minimal

Build and Test

Linux with rocm:

Windows with zluda

Pre-build Extension

To do

Benchmark

FP16, causal = False

seqlen with 32x aligened

[B N H D] format rearrange and contiguous to [B H N D]

seqlen without 32x aligened

[B N H D] format rearrange and contiguous to [B H N D]

fwd+bwd

FP16, causal = True

Performance in Stable Diffusion (ComfyUI)

SDXL U-Net Lora training

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

flash-attention-v2-RDNA3-minimal

Build and Test

Linux with rocm:

Windows with zluda

Pre-build Extension

To do

Benchmark

FP16, causal = False

seqlen with 32x aligened

[B N H D] format rearrange and contiguous to [B H N D]

seqlen without 32x aligened

[B N H D] format rearrange and contiguous to [B H N D]

fwd+bwd

FP16, causal = True

Performance in Stable Diffusion (ComfyUI)

SDXL U-Net Lora training

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages