[CUDA] cuDNN backward attention by zcbenz · Pull Request #2762 · ml-explore/mlx

zcbenz · 2025-11-14T01:25:51Z

This PR uses cuDNN backward attention for fast::ScaledDotProductAttention::vjp.

A new ScaledDotProductAttentionVJP primitive is added, but only implemented in CUDA backend.
For training a stats output is generated from the forward attention op, which is required by the backward op.
The array mask has not been implemented yet - so in actual training the new code may not kick in.
There are some duplicate code which I will clean up later together with the convolution code.

For training a 0.6B model:

before:

RAM usage: 60511MiB / 81920MiB

INFO:root:Model has 596049920 parameters.
INFO:root:step: 100, train_loss: 10.0955, grad_norm: 4.4649, its_per_sec: 2.1158, toks_per_sec: 17332.9950, tokens: 819200
INFO:root:step: 200, train_loss: 8.1339, grad_norm: 1.5591, its_per_sec: 2.6921, toks_per_sec: 22053.9309, tokens: 1638400
INFO:root:step: 300, train_loss: 7.6831, grad_norm: 1.5067, its_per_sec: 2.6923, toks_per_sec: 22055.2519, tokens: 2457600
INFO:root:step: 400, train_loss: 7.5040, grad_norm: 1.3949, its_per_sec: 2.6794, toks_per_sec: 21949.9022, tokens: 3276800
INFO:root:step: 500, train_loss: 7.2268, grad_norm: 1.4708, its_per_sec: 2.6820, toks_per_sec: 21970.5679, tokens: 4096000

after:

RAM usage: 52593MiB / 81920MiB

INFO:root:Model has 596049920 parameters.
INFO:root:step: 100, train_loss: 10.0885, grad_norm: 4.4325, its_per_sec: 3.0472, toks_per_sec: 24962.3116, tokens: 819200
INFO:root:step: 200, train_loss: 8.1385, grad_norm: 1.6766, its_per_sec: 3.4142, toks_per_sec: 27968.7873, tokens: 1638400
INFO:root:step: 300, train_loss: 7.6979, grad_norm: 2.3700, its_per_sec: 3.4130, toks_per_sec: 27959.5179, tokens: 2457600
INFO:root:step: 400, train_loss: 7.5094, grad_norm: 1.4560, its_per_sec: 3.3883, toks_per_sec: 27756.6980, tokens: 3276800
INFO:root:step: 500, train_loss: 7.2361, grad_norm: 1.8131, its_per_sec: 3.3830, toks_per_sec: 27713.4482, tokens: 4096000

mlx/backend/cuda/scaled_dot_product_attention.cu

mlx/fast.cpp

mlx/fast_primitives.h

awni

Very nice, LGTM!

zcbenz · 2025-11-18T08:20:17Z

I just noticed a behavior change in Metal backend by replacing detail::in_grad_tracing() with output_logsumexp: for training, the fast sdpa was fallbacking to unfused ops for both forward and backward passes, but with this change it would use fused sdpa for forward pass.

I think the change is good for performance and tests are passing, but I want to make sure I'm not missing anything?

awni · 2025-11-18T14:39:10Z

I think the change is good for performance and tests are passing, but I want to make sure I'm not missing anything?

Actually we added that in the first place because it's faster for training on Metal to use the unfused SDPA for both forward and backward (since the forward and backward can share some computation there).

zcbenz · 2025-11-18T22:11:58Z

Thanks for the info, I updated the code to keep behavior unchanged for Metal backend.

awni reviewed Nov 17, 2025

View reviewed changes

mlx/backend/cuda/scaled_dot_product_attention.cu Outdated Show resolved Hide resolved

awni reviewed Nov 17, 2025

View reviewed changes

mlx/fast.cpp Outdated Show resolved Hide resolved

awni reviewed Nov 17, 2025

View reviewed changes

mlx/fast_primitives.h Outdated Show resolved Hide resolved

zcbenz added 2 commits November 17, 2025 20:16

[CUDA] cuDNN backward attention

3a5e21a

Rename to output_logsumexp and avoid using in_grad_tracing

8f2b0c5

zcbenz force-pushed the cudnn-sdpa-backward branch from 6a22b32 to 8f2b0c5 Compare November 18, 2025 05:21

awni approved these changes Nov 18, 2025

View reviewed changes

zcbenz added 3 commits November 17, 2025 21:45

Fix mac build

9657c4e

Set output_logsumexp only when we have fused backward sdpa

c0fd1c5

Fix assertions

aa048f5

zcbenz force-pushed the cudnn-sdpa-backward branch from 05b01fa to aa048f5 Compare November 18, 2025 07:25

Use unfused sdpa for training on Metal

5160d4c

zcbenz force-pushed the cudnn-sdpa-backward branch from 0824fe4 to 5160d4c Compare November 18, 2025 21:57

zcbenz merged commit 6f35017 into ml-explore:main Nov 18, 2025
10 checks passed

zcbenz deleted the cudnn-sdpa-backward branch November 18, 2025 23:13

BrewTestBot mentioned this pull request Nov 20, 2025

mlx 0.30.0 Homebrew/homebrew-core#255173

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] cuDNN backward attention#2762

[CUDA] cuDNN backward attention#2762
zcbenz merged 6 commits intoml-explore:mainfrom
zcbenz:cudnn-sdpa-backward

zcbenz commented Nov 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

awni left a comment

Uh oh!

zcbenz commented Nov 18, 2025 •

edited

Loading

Uh oh!

awni commented Nov 18, 2025

Uh oh!

zcbenz commented Nov 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zcbenz commented Nov 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

awni left a comment

Choose a reason for hiding this comment

Uh oh!

zcbenz commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

awni commented Nov 18, 2025

Uh oh!

zcbenz commented Nov 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zcbenz commented Nov 18, 2025 •

edited

Loading