feat(v0.2.2): Ampere-optimized SGEMM with cp.async pipeline (18 TFLOPS) by m96-chan · Pull Request #37 · m96-chan/PyGPUkit

m96-chan · 2025-12-12T14:53:46Z

Summary

Add Ampere-optimized SGEMM kernel using cp.async for async memory transfers
Implement 4-stage software pipeline with proper latency hiding
Fix critical row-major A stride calculation bug that caused correctness failures at 4096+ matrix sizes
Require SM >= 80 (Ampere) for cp.async support

Performance Results (RTX 3090 Ti)

Matrix Size	TFLOPS	Efficiency	vs Previous
8192×8192	18.2	51%	+75% (from 10.4)
4096×4096	13.2	37%	+47% (from 9.0)
2048×2048	7.6	21%	-

Correctness Verification

All sizes pass with relative error < 3e-6:

✅ 256×256: PASS
✅ 512×512: PASS
✅ 1024×1024: PASS
✅ 2048×2048: PASS
✅ 4096×4096: PASS

Key Implementation Details

Configuration: BM=128, BN=128, BK=16, 256 threads, 8×8 thread tiles
Pipeline: 4-stage cp.async with wait_group(2)
Memory Layout: Row-major A with stride=20 (BK+4), B with stride=136 (BN+8)
Vectorization: float4 (16-byte) loads for both A and B matrices
Shared Memory: ~74 KB (4 stages × (2560 + 2176) × 4 bytes)

Files Changed

native/ops/matmul_f32_ampere.cuh - New Ampere-optimized kernel
native/ops/basic.cu - Integration with matmul dispatch
pyproject.toml - Require SM >= 80 for cp.async
benchmark_ampere.py - Performance benchmark script

Test plan

Correctness verification at all matrix sizes
Performance benchmark at 2048, 4096, 8192
Build on Windows with CUDA 12.4

🤖 Generated with Claude Code

…ests Performance improvements (v2): - 128x128 output tile with 256 threads (16x16) - 8x8 elements per thread (64 output elements) - BK=16 for better memory bandwidth utilization - Shared memory with padding to avoid bank conflicts - Performance: ~9-10 TFLOPS (47% improvement from 6.8 TFLOPS baseline) TDD tests added: - Minimum performance threshold tests (22 TFLOPS target) - Target performance tests (35.6 TFLOPS, 90% efficiency) - Correctness tests (all passing) Note: Target 22+ TFLOPS requires advanced optimizations: - Async copy (cp.async) for Ampere - Software pipelining with double/triple buffering - Tensor Cores (wmma) for FP16/TF32 - Detailed profiling with Nsight Compute 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Implement 4-stage software pipelined GEMM using cp.async for async memory transfers - Configuration: BM=128, BN=128, BK=16, 256 threads, 8x8 thread tiles - Fix critical row-major A stride calculation bug (BK+PAD, not BM+PAD) - Use float4 vectorized loads for both A and B matrices - Achieve ~18 TFLOPS on RTX 3090 Ti at 8192x8192 (51% theoretical efficiency) - Full correctness verification passes for all matrix sizes (256-4096) - Require SM >= 80 (Ampere) for cp.async support Performance results: - 8192x8192: 18.2 TFLOPS (max: 18.3) - 4096x4096: 13.2 TFLOPS - 2048x2048: 7.6 TFLOPS 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The Ampere GEMM kernel uses cp.async which requires SM 80 or higher. This fixes the cmake-check CI failure. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The CUDA DLL path setup now checks if the directory exists before attempting to add it, preventing FileNotFoundError on CI runners without CUDA installed. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Enable GitHub cache for CUDA toolkit downloads - Add ccache for C++/CUDA compilation caching - Should significantly reduce cmake-check time on subsequent runs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Update performance table with Ampere SGEMM results (18.2 TFLOPS) - Add cp.async pipeline features to README - Mark v0.2.1 and v0.2.2 as released in roadmap 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

m96-chan and others added 2 commits December 12, 2025 23:53

m96-chan changed the title ~~feat(v0.2.2): RTX 3090 Ti matmul optimization~~ feat(v0.2.2): Ampere-optimized SGEMM with cp.async pipeline (18 TFLOPS) Dec 13, 2025

m96-chan and others added 6 commits December 13, 2025 14:42

fix(lint): Fix unused variable warnings in benchmark

6f151a0

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

fix(lint): remove unused imports and variables in test file

58f782f

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

m96-chan force-pushed the feature/v0.2.2-3090ti-tuning branch from c6e7407 to 116d6b5 Compare December 13, 2025 06:06

m96-chan merged commit 412b550 into main Dec 13, 2025
13 checks passed

m96-chan deleted the feature/v0.2.2-3090ti-tuning branch December 13, 2025 06:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(v0.2.2): Ampere-optimized SGEMM with cp.async pipeline (18 TFLOPS)#37

feat(v0.2.2): Ampere-optimized SGEMM with cp.async pipeline (18 TFLOPS)#37
m96-chan merged 8 commits intomainfrom
feature/v0.2.2-3090ti-tuning

m96-chan commented Dec 12, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

m96-chan commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance Results (RTX 3090 Ti)

Correctness Verification

Key Implementation Details

Files Changed

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

m96-chan commented Dec 12, 2025 •

edited

Loading