Skip to content

Add challenge 89: Flash Attention Forward (Medium)#232

Open
claude[bot] wants to merge 1 commit intomainfrom
add-challenge-89-flash-attention
Open

Add challenge 89: Flash Attention Forward (Medium)#232
claude[bot] wants to merge 1 commit intomainfrom
add-challenge-89-flash-attention

Conversation

@claude
Copy link
Copy Markdown
Contributor

@claude claude bot commented Mar 31, 2026

Summary

  • Adds challenge 89: Flash Attention Forward (Medium difficulty)
  • Solvers implement scaled dot-product attention using the online softmax algorithm — the key innovation in Flash Attention that avoids materializing the full seq_len × seq_len attention matrix in global memory
  • Inputs: Q, K, V each [num_heads, seq_len, head_dim]; output same shape
  • Teaches: online (numerically stable) softmax, SRAM tiling, memory-bandwidth-efficient attention
  • Validated with --action submit on NVIDIA Tesla T4: all functional and performance tests pass
  • Performance test: num_heads=16, seq_len=4,096, head_dim=64

Why this challenge is distinct from existing attention challenges

  • Challenge 53 (Causal Self-Attention, Hard): single-head naive attention with causal mask
  • Challenge 80 (Grouped Query Attention, Medium): multi-head GQA with shared KV heads
  • This challenge: bidirectional (no causal mask) multi-head attention with the focus on the Flash Attention online-softmax algorithm — a fundamentally different computational approach to the same mathematical operation

Test plan

  • All 6 starter files present (.cu, .pytorch.py, .triton.py, .jax.py, .cute.py, .mojo)
  • challenge.html passes all checklist items (starts with <p>, correct <h2> sections, example matches generate_example_test(), performance bullet)
  • challenge.py has all 6 methods; generate_functional_test() returns 10 test cases covering edge cases, powers-of-2, non-powers-of-2, and realistic sizes
  • generate_performance_test() fits 5× in 16 GB VRAM (4 tensors × 16 × 4096 × 64 × 4B ≈ 64 MB)
  • pre-commit run --all-files passes
  • run_challenge.py --action submit passes all tests on NVIDIA Tesla T4

🤖 Generated with Claude Code

Introduces a Flash Attention forward-pass challenge teaching the
online-softmax tiling algorithm that avoids materializing the full
seq_len × seq_len attention matrix in global memory.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants