Add Flash Attention 4 (CuTe DSL) Support #42404

sambhavnoobcoder · 2025-11-25T17:30:39Z

Problem Statement

Flash Attention 4 represents a significant architectural shift in the flash-attention package:

Different import path: FA4 uses flash_attn.cute submodule instead of the main flash_attn package
API incompatibility: FA4's flash_attn_varlen_func has a different signature - it does NOT accept max_seqlen_q and max_seqlen_k parameters (calculates them internally from cu_seqlens)
New parameters: FA4 introduces new optimization parameters like learnable_sink, num_splits, and pack_gqa
Removed parameters: FA4 removes dropout_p and alibi_slopes support
Hardware requirements: FA4 requires CUDA compute capability SM 8.0+ (Ampere or newer), with optimal performance on SM 9.0+ (Hopper/Blackwell)

Without explicit FA4 support, users cannot leverage these improvements even when they have compatible hardware and the flash-attn package with CuTe DSL installed.

Solution Design

The solution maintains full backward compatibility while adding FA4 support through:

1. Detection Layer

Added is_flash_attn_4_available() function that checks:

CUDA availability
flash-attn package installation
Presence of flash_attn.cute submodule
GPU compute capability >= SM 8.0

2. Priority-Based Auto-Selection

When attn_implementation=None, the selection order is:

FA4 > FA3 > FA2 > SDPA > Eager

FA4 gets highest priority on compatible hardware for optimal performance.

3. Runtime Introspection

Created _is_using_fa4() helper that uses function signature inspection to detect FA4 vs FA2/FA3 at runtime. This enables conditional code paths without hardcoded version checks.

4. Conditional Varlen Calls

Modified two critical call sites in _flash_attention_forward() to conditionally pass parameters:

FA4 path: Calls without max_seqlen_q and max_seqlen_k (calculates internally)
FA2/FA3 path: Calls with explicit max_seqlen parameters (required)

5. Parameter Support

Extended _process_flash_attention_kwargs() to handle FA4-specific parameters, with automatic filtering based on introspection to maintain compatibility across versions.

6. Registration

Registered flash_attention_4 in AttentionInterface._global_mapping to enable explicit selection via attn_implementation="flash_attention_4".

Implementation Details

Core Changes

Detection and Import

Added FA4 detection function with hardware capability checks in import_utils
Exported detection function in utils module
Modified import logic to handle flash_attn.cute submodule

Integration Layer

Updated helper functions to recognize FA4
Added introspection-based FA4 detection
Extended parameter processing for FA4-specific options
Implemented conditional varlen function calls at both call sites

Interface Registration

Registered FA4 in attention interface mapping

Testing Infrastructure

Added test decorator for FA4-specific tests

New Files

Test Suite
Comprehensive test coverage including:

Detection function tests
Import tests with GPU
Basic forward pass tests
Causal attention tests
Varlen function API signature verification
HF integration tests
FA4-specific parameter tests (softcap, window_size)

Validation Script
Quick validation script for SSH GPU access that checks:

CUDA environment and compute capability
Package installations
FA4 detection
Import functionality
API signature correctness
Basic forward pass
HF integration layer

Usage Examples
Demonstrates:

Explicit FA4 selection
Automatic implementation selection
Performance comparison across implementations

Testing Status

Automated Checks
Created and ran comprehensive verification script checking:

Detection function exists and exported
All imports configured correctly
FA4 integrated into helper functions
Import paths from flash_attn.cute configured
Introspection helper created
FA4 parameters added
Both varlen call sites protected with conditionals
AttentionInterface registration complete
Test decorator added
All files compile without errors

All 14 core integration checks passed, plus 7 additional file checks passed.

Pending Testing (Requires GPU)

GPU Validation Required
Due to lack of CUDA GPU access during development, the following tests are pending:

Basic Functionality
- FA4 detection on real GPU
- Import from flash_attn.cute
- Basic forward pass execution
- Varlen function calls
Integration Tests
- Full test suite execution
- Model inference with FA4
- Numerical accuracy comparison (FA4 vs FA2)
- Performance benchmarking
Real-World Usage
- Testing with popular models (Llama, Mistral, Qwen2)
- Testing with static and dynamic caches
- Testing varlen sequences in production scenarios
- Training with FA4 (if backward pass available)

Hardware Requirements

Component	Requirement
GPU	NVIDIA with CUDA support
Compute Capability	SM 8.0+ (Ampere/Hopper/Blackwell)
Optimal Performance	SM 9.0+ (Hopper H100/H200, Blackwell)
CUDA	11.8+ (12.8+ for Blackwell)
Software	flash-attn with CuTe DSL support

Known Limitations

Dropout Not Supported: FA4 doesn't have dropout_p parameter - training with dropout will automatically fall back to FA2/eager
ALiBi Slopes Not Supported: Models using ALiBi (e.g., BLOOM) cannot use FA4 - will fall back to FA2/eager
Backward Pass: May be inference-only initially depending on flash-attn release
Softcap in Backward: softcap != 0.0 may be restricted during backward pass

All limitations are handled gracefully via automatic fallback.

Usage

Explicit FA4 Selection

Users can explicitly request FA4 when loading models.

Auto-Selection (Recommended)

When no attention implementation is specified, transformers will automatically select the best available implementation, with FA4 receiving highest priority on compatible hardware.

Check Availability

Users can check if FA4 is available using the is_flash_attn_4_available() function.

fixes : #42405

sambhavnoobcoder added 8 commits November 25, 2025 22:30

Add Flash Attention 4 detection function

754ce60

Export is_flash_attn_4_available in utils

d59900d

Integrate Flash Attention 4 with CuTe DSL support

d2956b8

Register flash_attention_4 in AttentionInterface

f8dd5ae

Add require_flash_attn_4 test decorator

073af30

Add Flash Attention 4 test suite

00fb149

Add FA4 validation script for GPU testing

c26e32a

Add Flash Attention 4 usage examples

5dc3258

sambhavnoobcoder changed the title ~~# Add Flash Attention 4 (CuTe DSL) Support~~ Add Flash Attention 4 (CuTe DSL) Support Nov 25, 2025

vasqu mentioned this pull request Nov 26, 2025

[FA4] Initial support #42435

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Flash Attention 4 (CuTe DSL) Support #42404

Add Flash Attention 4 (CuTe DSL) Support #42404

sambhavnoobcoder commented Nov 25, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add Flash Attention 4 (CuTe DSL) Support #42404

Are you sure you want to change the base?

Add Flash Attention 4 (CuTe DSL) Support #42404

Conversation

sambhavnoobcoder commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem Statement

Solution Design

1. Detection Layer

2. Priority-Based Auto-Selection

3. Runtime Introspection

4. Conditional Varlen Calls

5. Parameter Support

6. Registration

Implementation Details

Core Changes

New Files

Testing Status

Pending Testing (Requires GPU)

Hardware Requirements

Known Limitations

Usage

Explicit FA4 Selection

Auto-Selection (Recommended)

Check Availability

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sambhavnoobcoder commented Nov 25, 2025 •

edited

Loading