Skip to content

feat(v0.2.2): Ampere-optimized SGEMM with cp.async pipeline (18 TFLOPS)#37

Merged
m96-chan merged 8 commits intomainfrom
feature/v0.2.2-3090ti-tuning
Dec 13, 2025
Merged

feat(v0.2.2): Ampere-optimized SGEMM with cp.async pipeline (18 TFLOPS)#37
m96-chan merged 8 commits intomainfrom
feature/v0.2.2-3090ti-tuning

Conversation

@m96-chan
Copy link
Copy Markdown
Owner

@m96-chan m96-chan commented Dec 12, 2025

Summary

  • Add Ampere-optimized SGEMM kernel using cp.async for async memory transfers
  • Implement 4-stage software pipeline with proper latency hiding
  • Fix critical row-major A stride calculation bug that caused correctness failures at 4096+ matrix sizes
  • Require SM >= 80 (Ampere) for cp.async support

Performance Results (RTX 3090 Ti)

Matrix Size TFLOPS Efficiency vs Previous
8192×8192 18.2 51% +75% (from 10.4)
4096×4096 13.2 37% +47% (from 9.0)
2048×2048 7.6 21% -

Correctness Verification

All sizes pass with relative error < 3e-6:

  • ✅ 256×256: PASS
  • ✅ 512×512: PASS
  • ✅ 1024×1024: PASS
  • ✅ 2048×2048: PASS
  • ✅ 4096×4096: PASS

Key Implementation Details

  • Configuration: BM=128, BN=128, BK=16, 256 threads, 8×8 thread tiles
  • Pipeline: 4-stage cp.async with wait_group(2)
  • Memory Layout: Row-major A with stride=20 (BK+4), B with stride=136 (BN+8)
  • Vectorization: float4 (16-byte) loads for both A and B matrices
  • Shared Memory: ~74 KB (4 stages × (2560 + 2176) × 4 bytes)

Files Changed

  • native/ops/matmul_f32_ampere.cuh - New Ampere-optimized kernel
  • native/ops/basic.cu - Integration with matmul dispatch
  • pyproject.toml - Require SM >= 80 for cp.async
  • benchmark_ampere.py - Performance benchmark script

Test plan

  • Correctness verification at all matrix sizes
  • Performance benchmark at 2048, 4096, 8192
  • Build on Windows with CUDA 12.4

🤖 Generated with Claude Code

m96-chan and others added 2 commits December 12, 2025 23:53
…ests

Performance improvements (v2):
- 128x128 output tile with 256 threads (16x16)
- 8x8 elements per thread (64 output elements)
- BK=16 for better memory bandwidth utilization
- Shared memory with padding to avoid bank conflicts
- Performance: ~9-10 TFLOPS (47% improvement from 6.8 TFLOPS baseline)

TDD tests added:
- Minimum performance threshold tests (22 TFLOPS target)
- Target performance tests (35.6 TFLOPS, 90% efficiency)
- Correctness tests (all passing)

Note: Target 22+ TFLOPS requires advanced optimizations:
- Async copy (cp.async) for Ampere
- Software pipelining with double/triple buffering
- Tensor Cores (wmma) for FP16/TF32
- Detailed profiling with Nsight Compute

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Implement 4-stage software pipelined GEMM using cp.async for async memory transfers
- Configuration: BM=128, BN=128, BK=16, 256 threads, 8x8 thread tiles
- Fix critical row-major A stride calculation bug (BK+PAD, not BM+PAD)
- Use float4 vectorized loads for both A and B matrices
- Achieve ~18 TFLOPS on RTX 3090 Ti at 8192x8192 (51% theoretical efficiency)
- Full correctness verification passes for all matrix sizes (256-4096)
- Require SM >= 80 (Ampere) for cp.async support

Performance results:
- 8192x8192: 18.2 TFLOPS (max: 18.3)
- 4096x4096: 13.2 TFLOPS
- 2048x2048: 7.6 TFLOPS

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@m96-chan m96-chan changed the title feat(v0.2.2): RTX 3090 Ti matmul optimization feat(v0.2.2): Ampere-optimized SGEMM with cp.async pipeline (18 TFLOPS) Dec 13, 2025
m96-chan and others added 6 commits December 13, 2025 14:42
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The Ampere GEMM kernel uses cp.async which requires SM 80 or higher.
This fixes the cmake-check CI failure.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The CUDA DLL path setup now checks if the directory exists before
attempting to add it, preventing FileNotFoundError on CI runners
without CUDA installed.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Enable GitHub cache for CUDA toolkit downloads
- Add ccache for C++/CUDA compilation caching
- Should significantly reduce cmake-check time on subsequent runs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Update performance table with Ampere SGEMM results (18.2 TFLOPS)
- Add cp.async pipeline features to README
- Mark v0.2.1 and v0.2.2 as released in roadmap

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@m96-chan m96-chan force-pushed the feature/v0.2.2-3090ti-tuning branch from c6e7407 to 116d6b5 Compare December 13, 2025 06:06
@m96-chan m96-chan merged commit 412b550 into main Dec 13, 2025
13 checks passed
@m96-chan m96-chan deleted the feature/v0.2.2-3090ti-tuning branch December 13, 2025 06:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant