feat(v0.2.2): Ampere-optimized SGEMM with cp.async pipeline (18 TFLOPS)#37
Merged
feat(v0.2.2): Ampere-optimized SGEMM with cp.async pipeline (18 TFLOPS)#37
Conversation
…ests Performance improvements (v2): - 128x128 output tile with 256 threads (16x16) - 8x8 elements per thread (64 output elements) - BK=16 for better memory bandwidth utilization - Shared memory with padding to avoid bank conflicts - Performance: ~9-10 TFLOPS (47% improvement from 6.8 TFLOPS baseline) TDD tests added: - Minimum performance threshold tests (22 TFLOPS target) - Target performance tests (35.6 TFLOPS, 90% efficiency) - Correctness tests (all passing) Note: Target 22+ TFLOPS requires advanced optimizations: - Async copy (cp.async) for Ampere - Software pipelining with double/triple buffering - Tensor Cores (wmma) for FP16/TF32 - Detailed profiling with Nsight Compute 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Implement 4-stage software pipelined GEMM using cp.async for async memory transfers - Configuration: BM=128, BN=128, BK=16, 256 threads, 8x8 thread tiles - Fix critical row-major A stride calculation bug (BK+PAD, not BM+PAD) - Use float4 vectorized loads for both A and B matrices - Achieve ~18 TFLOPS on RTX 3090 Ti at 8192x8192 (51% theoretical efficiency) - Full correctness verification passes for all matrix sizes (256-4096) - Require SM >= 80 (Ampere) for cp.async support Performance results: - 8192x8192: 18.2 TFLOPS (max: 18.3) - 4096x4096: 13.2 TFLOPS - 2048x2048: 7.6 TFLOPS 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The Ampere GEMM kernel uses cp.async which requires SM 80 or higher. This fixes the cmake-check CI failure. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The CUDA DLL path setup now checks if the directory exists before attempting to add it, preventing FileNotFoundError on CI runners without CUDA installed. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Enable GitHub cache for CUDA toolkit downloads - Add ccache for C++/CUDA compilation caching - Should significantly reduce cmake-check time on subsequent runs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Update performance table with Ampere SGEMM results (18.2 TFLOPS) - Add cp.async pipeline features to README - Mark v0.2.1 and v0.2.2 as released in roadmap 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
c6e7407 to
116d6b5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Performance Results (RTX 3090 Ti)
Correctness Verification
All sizes pass with relative error < 3e-6:
Key Implementation Details
Files Changed
native/ops/matmul_f32_ampere.cuh- New Ampere-optimized kernelnative/ops/basic.cu- Integration with matmul dispatchpyproject.toml- Require SM >= 80 for cp.asyncbenchmark_ampere.py- Performance benchmark scriptTest plan
🤖 Generated with Claude Code