Sage attn integration wip by weiyanlin117 · Pull Request #1 · weiyanlin117/ccv

weiyanlin117 · 2025-07-28T08:44:22Z

sanity test passed

…e_attn.cu

weiyanlin117 · 2025-08-19T21:06:32Z

test/int/nnc/cublas.tests.c

 	ccv_nnc_tensor_free(bt);
 }

+TEST_CASE("direct sage vs flash attention NHD performance test")


added a perf test

make -C test/int/nnc cublas.tests && test/int/nnc/cublas.tests " performance test"

=== Direct SageAttention NHD Performance Test ===
Testing performance with memory reuse across 1000 runs

[Trial 0] Config: B=32, R=160, C=128, Hq=8, Hk=8, D=64, causal=0
Full tensor comparison (1342177280 total_Flops):
GPU Execution Time: 0.0175 ms (average of 10000 runs with memory reuse)

[Trial 1] Config: B=12, R=256, C=128, Hq=8, Hk=8, D=128, causal=0
Full tensor comparison (1610612736 total_Flops):
GPU Execution Time: 0.0191 ms (average of 10000 runs with memory reuse)

[Trial 2] Config: B=16, R=128, C=128, Hq=8, Hk=8, D=128, causal=0
Full tensor comparison (1073741824 total_Flops):
GPU Execution Time: 0.0139 ms (average of 10000 runs with memory reuse)

[Trial 3] Config: B=1, R=77, C=128, Hq=8, Hk=8, D=64, causal=0
Full tensor comparison (20185088 total_Flops):
GPU Execution Time: 0.0073 ms (average of 10000 runs with memory reuse)

[Trial 4] Config: B=2, R=77, C=128, Hq=8, Hk=2, D=64, causal=0
Full tensor comparison (40370176 total_Flops):
GPU Execution Time: 0.0073 ms (average of 10000 runs with memory reuse)

[Trial 5] Config: B=1, R=5, C=5, Hq=32, Hk=8, D=128, causal=0
Full tensor comparison (409600 total_Flops):
GPU Execution Time: 0.0071 ms (average of 10000 runs with memory reuse)

=== Direct SageAttention NHD Performance Summary ===
Individual trial times (ms):
Trial 0: 0.0175 ms
Trial 1: 0.0191 ms
Trial 2: 0.0139 ms
Trial 3: 0.0073 ms
Trial 4: 0.0073 ms
Trial 5: 0.0071 ms
Fastest trial: 5 (0.0071 ms)
Slowest trial: 1 (0.0191 ms)

[Trial 0] Config: B=32, R=160, C=128, Hq=8, Hk=8, D=64, causal=0
Full tensor comparison (1342177280 total_flops):
GPU Execution Time: 0.0945 ms (average of 10000 runs with memory reuse)

[Trial 1] Config: B=12, R=256, C=128, Hq=8, Hk=8, D=128, causal=0
Full tensor comparison (1610612736 total_flops):
GPU Execution Time: 0.0932 ms (average of 10000 runs with memory reuse)

[Trial 2] Config: B=16, R=128, C=128, Hq=8, Hk=8, D=128, causal=0
Full tensor comparison (1073741824 total_flops):
GPU Execution Time: 0.0870 ms (average of 10000 runs with memory reuse)

[Trial 3] Config: B=1, R=77, C=128, Hq=8, Hk=8, D=64, causal=0
Full tensor comparison (20185088 total_flops):
GPU Execution Time: 0.0116 ms (average of 10000 runs with memory reuse)

[Trial 4] Config: B=2, R=77, C=128, Hq=8, Hk=2, D=64, causal=0
Full tensor comparison (40370176 total_flops):
GPU Execution Time: 0.0117 ms (average of 10000 runs with memory reuse)

[Trial 5] Config: B=1, R=5, C=5, Hq=32, Hk=8, D=128, causal=0
Full tensor comparison (409600 total_flops):
GPU Execution Time: 0.0094 ms (average of 10000 runs with memory reuse)

=== FlashAttention Performance Summary ===
Individual trial times (ms):
Trial 0: 0.0945 ms
Trial 1: 0.0932 ms
Trial 2: 0.0870 ms
Trial 3: 0.0116 ms
Trial 4: 0.0117 ms
Trial 5: 0.0094 ms
Fastest trial: 5 (0.0094 ms)
Slowest trial: 0 (0.0945 ms)

weiyanlin117 · 2025-08-19T21:08:25Z

test/int/nnc/cublas.tests.c

+				2,                   // qk_quant_gran: 2=per_warp
+				scale,               // sm_scale
+				0,                   // return_lse: false
+				1,                   // pv_accum_dtype: 2=FP32


using FP32 (2)

will be 10% slower

[1/1] [RUN] direct sage vs flash attention NHD performance test ...
=== Direct SageAttention NHD Performance Test ===
Testing performance with memory reuse across 1000 runs

[Trial 0] Config: B=32, R=160, C=128, Hq=8, Hk=8, D=64, causal=0
Full tensor comparison (1342177280 total_Flops):
GPU Execution Time: 0.0199 ms (average of 10000 runs with memory reuse)

[Trial 1] Config: B=12, R=256, C=128, Hq=8, Hk=8, D=128, causal=0
Full tensor comparison (1610612736 total_Flops):
GPU Execution Time: 0.0197 ms (average of 10000 runs with memory reuse)

[Trial 2] Config: B=16, R=128, C=128, Hq=8, Hk=8, D=128, causal=0
Full tensor comparison (1073741824 total_Flops):
GPU Execution Time: 0.0157 ms (average of 10000 runs with memory reuse)

[Trial 3] Config: B=1, R=77, C=128, Hq=8, Hk=8, D=64, causal=0
Full tensor comparison (20185088 total_Flops):
GPU Execution Time: 0.0079 ms (average of 10000 runs with memory reuse)

[Trial 4] Config: B=2, R=77, C=128, Hq=8, Hk=2, D=64, causal=0
Full tensor comparison (40370176 total_Flops):
GPU Execution Time: 0.0079 ms (average of 10000 runs with memory reuse)

[Trial 5] Config: B=1, R=5, C=5, Hq=32, Hk=8, D=128, causal=0
Full tensor comparison (409600 total_Flops):
GPU Execution Time: 0.0079 ms (average of 10000 runs with memory reuse)

=== Direct SageAttention NHD Performance Summary ===
Individual trial times (ms):
Trial 0: 0.0199 ms
Trial 1: 0.0197 ms
Trial 2: 0.0157 ms
Trial 3: 0.0079 ms
Trial 4: 0.0079 ms
Trial 5: 0.0079 ms
Fastest trial: 5 (0.0079 ms)
Slowest trial: 0 (0.0199 ms)

FP16_FP32 mixed:
=== Direct SageAttention NHD Performance Summary ===
Individual trial times (ms):
Trial 0: 0.0175 ms
Trial 1: 0.0193 ms
Trial 2: 0.0140 ms
Trial 3: 0.0072 ms
Trial 4: 0.0072 ms
Trial 5: 0.0070 ms
Fastest trial: 5 (0.0070 ms)
Slowest trial: 1 (0.0193 ms)

- "fp32": PV accumulation is done in FP32. This is the most accurate option but may be slower than "fp16" due to CUDA core overhead. - "fp16+fp32": PV accumulation is done in FP16, but added to a FP32 buffer every few iterations. This offers a balance between speed and accuracy.

weiyanlin117 · 2025-08-19T21:17:14Z

sage attn CMD without memory cache
(only calculate before and after CMD triggering)

[Trial 0] Config: B=32, R=160, C=128, Hq=8, Hk=8, D=64, causal=0
GPU Execution Time: 0.4437 ms (average of 1000 runs)

[Trial 1] Config: B=12, R=256, C=128, Hq=8, Hk=8, D=128, causal=0
GPU Execution Time: 0.2966 ms (average of 1000 runs)

sage attn CMD with memory cache

[Trial 0] Config: B=32, R=160, C=128, Hq=8, Hk=8, D=64, causal=0
GPU Execution Time: 0.2108 ms (average of 1000 runs)

[Trial 1] Config: B=12, R=256, C=128, Hq=8, Hk=8, D=128, causal=0
GPU Execution Time: 0.1605 ms (average of 1000 runs)

weiyanlin117 added 14 commits July 28, 2025 01:40

test passed

f7ab492

sage attn

2d7c79e

_ccv_nnc_scaled_dot_product_attention_forw sage test passed

7e327d2

scale tensor created internally

08a7159

NHD work with sage_attn test, but only for R = C

5b2302a

update test case, now R!=C is supported correct, next is is_causal=1

05bb94c

update test case

fde208a

remove debug printf

437ac52

now flash attn will run sage attn when 8U turns on

bf9b16d

clean up all temporary unit test for each step testing and remove sag…

30ac2bd

…e_attn.cu

one more clean up

db774dd

resolve conflicts

525e767

fix bazel building

13f605e

add perf test for pure kernel

5ec30fa

weiyanlin117 commented Aug 19, 2025

View reviewed changes

weiyanlin117 added 7 commits August 19, 2025 23:25

at least workspace work for some tensors

291e4fe

workspace without reduce

e9cefeb

each step works

93c63e7

sage attn 2++ direct works

62c171c

now ccv sage ++ exactly same as python

bda224d

remove unnecessary code

c533087

polish

e477c85

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sage attn integration wip#1

Sage attn integration wip#1
weiyanlin117 wants to merge 21 commits intounstablefrom
sage_attn_integration

weiyanlin117 commented Jul 28, 2025

Uh oh!

weiyanlin117 Aug 19, 2025

Uh oh!

weiyanlin117 Aug 19, 2025

Uh oh!

weiyanlin117 Aug 19, 2025

Uh oh!

weiyanlin117 commented Aug 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

weiyanlin117 commented Jul 28, 2025

Uh oh!

weiyanlin117 Aug 19, 2025

Choose a reason for hiding this comment

=== Direct SageAttention NHD Performance Summary === Individual trial times (ms): Trial 0: 0.0175 ms Trial 1: 0.0191 ms Trial 2: 0.0139 ms Trial 3: 0.0073 ms Trial 4: 0.0073 ms Trial 5: 0.0071 ms Fastest trial: 5 (0.0071 ms) Slowest trial: 1 (0.0191 ms)

=== FlashAttention Performance Summary === Individual trial times (ms): Trial 0: 0.0945 ms Trial 1: 0.0932 ms Trial 2: 0.0870 ms Trial 3: 0.0116 ms Trial 4: 0.0117 ms Trial 5: 0.0094 ms Fastest trial: 5 (0.0094 ms) Slowest trial: 0 (0.0945 ms)

Uh oh!

weiyanlin117 Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

weiyanlin117 Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

weiyanlin117 commented Aug 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

=== Direct SageAttention NHD Performance Summary ===
Individual trial times (ms):
Trial 0: 0.0175 ms
Trial 1: 0.0191 ms
Trial 2: 0.0139 ms
Trial 3: 0.0073 ms
Trial 4: 0.0073 ms
Trial 5: 0.0071 ms
Fastest trial: 5 (0.0071 ms)
Slowest trial: 1 (0.0191 ms)

=== FlashAttention Performance Summary ===
Individual trial times (ms):
Trial 0: 0.0945 ms
Trial 1: 0.0932 ms
Trial 2: 0.0870 ms
Trial 3: 0.0116 ms
Trial 4: 0.0117 ms
Trial 5: 0.0094 ms
Fastest trial: 5 (0.0094 ms)
Slowest trial: 0 (0.0945 ms)