High-performance GEMM (General Matrix Multiply) kernels for NVIDIA Hopper architecture (SM90) using Triton.
- Triton-based: Written in Triton for easy development and optimization
- Hopper Optimized: Designed for H100/H800 GPUs with Tensor Core support
- Multiple Precisions: FP32, FP16, BF16, FP8 (E4M3/E5M2)
- Benchmark Framework: Compare against PyTorch/cuBLAS
- Correctness Verification: Automatic validation against reference implementations
HopperGemm/
├── pyproject.toml # Python package configuration
├── src/
│ └── hoppergemm/
│ ├── __init__.py # Package entry point
│ ├── kernels/
│ │ ├── __init__.py
│ │ ├── gemm.py # Basic Triton GEMM kernel
│ │ └── gemm_hopper.py # Hopper-optimized kernel
│ └── gemm.py # High-level API (optional)
├── benchmarks/
│ ├── benchmark_gemm.py # Performance benchmark (vs PyTorch/cuBLAS)
│ └── correctness_test.py # Correctness verification
├── tests/
│ └── test_gemm.py # Unit tests
├── tools/
│ ├── setup_submodules.sh # Initialize git submodules
│ ├── build_cutlass.sh # Build CUTLASS examples
│ ├── benchmark_cublas.py # cuBLAS benchmark script
│ ├── benchmark_cutlass.py # CUTLASS benchmark script
│ └── run_all_benchmarks.py # Comprehensive benchmark comparison
└── third_party/
└── cutlass/ # CUTLASS submodule (reference)
- Python: 3.9+
- PyTorch: 2.0+ (with CUDA support)
- Triton: 3.0+
- GPU: NVIDIA H100, H800, or H200 (SM90)
- CUDA: 12.0+
# Clone repository with submodules
git clone --recursive https://github.com/xInference/HopperGemm.git
cd HopperGemm
# Or if already cloned without submodules, initialize them:
git submodule update --init --recursive
# Install package
pip install -e .
# Install dev dependencies
pip install -e ".[dev]"For benchmarking against CUTLASS reference implementations:
# Build CUTLASS examples (requires CUDA 12+ and SM90 GPU)
./tools/build_cutlass.sh
# Or build all examples (takes longer)
./tools/build_cutlass_all.shimport torch
from hoppergemm import matmul, matmul_hopper
# Create input matrices
a = torch.randn(1024, 1024, dtype=torch.float16, device='cuda')
b = torch.randn(1024, 1024, dtype=torch.float16, device='cuda')
# Basic Triton GEMM
c = matmul(a, b)
# Hopper-optimized GEMM
c_opt = matmul_hopper(a, b)
# Compare with PyTorch (cuBLAS)
c_ref = torch.mm(a, b)
# Check correctness
print(f"Max difference: {(c - c_ref).abs().max().item()}")Compare all implementations at once:
# Run all benchmarks (HopperGemm, cuBLAS, CUTLASS)
python tools/run_all_benchmarks.py --m 1024 2048 4096 --n 1024 2048 4096 --k 1024 2048 4096 --dtype fp16Compare HopperGemm against PyTorch/cuBLAS:
python benchmarks/benchmark_gemm.py --m 1024 2048 4096 --n 1024 2048 4096 --k 1024 2048 4096 --dtype fp16Output example:
Kernel Time (ms) TFLOPS BW (GB/s) Efficiency
-----------------------------------------------------------------------
PyTorch/cuBLAS (FP16) ★ 0.1234 98.76 1245.67 baseline
HopperGemm Basic (FP16) 0.4567 53.34 654.32 27.0%
HopperGemm Opt (FP16) 0.2345 78.90 987.45 52.6%
Verify numerical accuracy against PyTorch/cuBLAS:
python benchmarks/correctness_test.py --m 128 256 512 1024 --dtype fp16Output example:
Testing GEMM: M=1024, N=1024, K=1024, dtype=torch.float16
----------------------------------------------------------------------
HopperGemm Basic: PASSED
Relative error: 1.23e-06 (tolerance: 1.00e-03)
Absolute error: 2.45e-05
HopperGemm Hopper-Opt: PASSED
Relative error: 3.45e-07 (tolerance: 1.00e-03)
Absolute error: 1.23e-05
Summary: 2/2 tests passed
For developing Hopper-optimized GEMM kernels, the primary benchmark is:
- NVIDIA's highly optimized library
- Target: ~98 TFLOPS for FP16 on H100
- Use
torch.mm()ortorch.matmul()as reference
- NVIDIA's open-source template library
- Excellent learning resource for GEMM optimization
- Located in
third_party/cutlassas git submodule - Hopper example:
cutlass/examples/48_hopper_warp_specialized_gemm
| Approach | Pros | Cons |
|---|---|---|
| CUDA C++ | Maximum control, best performance | Complex, requires deep GPU knowledge |
| Triton | Python-like syntax, easier optimization | Less control over low-level details |
| cuBLAS | Best performance out-of-box | Closed-source, limited flexibility |
Triton provides a good balance:
- Easier to write and understand
- Can achieve near-cuBLAS performance
- Great for learning GPU optimization concepts
- Project structure and Python package
- Basic Triton GEMM kernel
- Benchmark framework with PyTorch comparison
- Correctness verification
- Larger block sizes for Tensor Core
- Multi-stage pipelining
- Optimized memory access patterns
- FP8 precision support
- Auto-tuning for different problem sizes
- Batch GEMM support
- Mixed precision accumulation
- Stream-K decomposition
| Implementation | Target TFLOPS (FP16) | Notes |
|---|---|---|
| cuBLAS (PyTorch) | ~98 | NVIDIA reference, highly optimized |
| CUTLASS Hopper | ~95 | Open-source, excellent learning |
| HopperGemm Goal | 80+ | 80%+ cuBLAS efficiency |
- Triton Documentation - Official Triton docs
- Triton GEMM Tutorial - Great starting point
- NVIDIA CUTLASS - CUDA Templates for Linear Algebra
- Hopper GEMM Example - CUTLASS Hopper kernel
- cuBLAS Documentation - NVIDIA cuBLAS library
MIT License - see LICENSE for details.