HopperGemm

High-performance GEMM (General Matrix Multiply) kernels for NVIDIA Hopper architecture (SM90) using Triton.

Features

Triton-based: Written in Triton for easy development and optimization
Hopper Optimized: Designed for H100/H800 GPUs with Tensor Core support
Multiple Precisions: FP32, FP16, BF16, FP8 (E4M3/E5M2)
Benchmark Framework: Compare against PyTorch/cuBLAS
Correctness Verification: Automatic validation against reference implementations

Project Structure

HopperGemm/
├── pyproject.toml          # Python package configuration
├── src/
│   └── hoppergemm/
│       ├── __init__.py     # Package entry point
│       ├── kernels/
│       │   ├── __init__.py
│       │   ├── gemm.py         # Basic Triton GEMM kernel
│       │   └── gemm_hopper.py  # Hopper-optimized kernel
│       └── gemm.py         # High-level API (optional)
├── benchmarks/
│   ├── benchmark_gemm.py   # Performance benchmark (vs PyTorch/cuBLAS)
│   └── correctness_test.py # Correctness verification
├── tests/
│   └── test_gemm.py        # Unit tests
├── tools/
│   ├── setup_submodules.sh     # Initialize git submodules
│   ├── build_cutlass.sh        # Build CUTLASS examples
│   ├── benchmark_cublas.py     # cuBLAS benchmark script
│   ├── benchmark_cutlass.py    # CUTLASS benchmark script
│   └── run_all_benchmarks.py   # Comprehensive benchmark comparison
└── third_party/
    └── cutlass/            # CUTLASS submodule (reference)

Prerequisites

Python: 3.9+
PyTorch: 2.0+ (with CUDA support)
Triton: 3.0+
GPU: NVIDIA H100, H800, or H200 (SM90)
CUDA: 12.0+

Installation

# Clone repository with submodules
git clone --recursive https://github.com/xInference/HopperGemm.git
cd HopperGemm

# Or if already cloned without submodules, initialize them:
git submodule update --init --recursive

# Install package
pip install -e .

# Install dev dependencies
pip install -e ".[dev]"

Building CUTLASS Examples (Optional)

For benchmarking against CUTLASS reference implementations:

# Build CUTLASS examples (requires CUDA 12+ and SM90 GPU)
./tools/build_cutlass.sh

# Or build all examples (takes longer)
./tools/build_cutlass_all.sh

Quick Start

import torch
from hoppergemm import matmul, matmul_hopper

# Create input matrices
a = torch.randn(1024, 1024, dtype=torch.float16, device='cuda')
b = torch.randn(1024, 1024, dtype=torch.float16, device='cuda')

# Basic Triton GEMM
c = matmul(a, b)

# Hopper-optimized GEMM
c_opt = matmul_hopper(a, b)

# Compare with PyTorch (cuBLAS)
c_ref = torch.mm(a, b)

# Check correctness
print(f"Max difference: {(c - c_ref).abs().max().item()}")

Benchmarks

Comprehensive Benchmark (Recommended)

Compare all implementations at once:

# Run all benchmarks (HopperGemm, cuBLAS, CUTLASS)
python tools/run_all_benchmarks.py --m 1024 2048 4096 --n 1024 2048 4096 --k 1024 2048 4096 --dtype fp16

Performance Benchmark

Compare HopperGemm against PyTorch/cuBLAS:

python benchmarks/benchmark_gemm.py --m 1024 2048 4096 --n 1024 2048 4096 --k 1024 2048 4096 --dtype fp16

Output example:

Kernel                     Time (ms)     TFLOPS   BW (GB/s)   Efficiency
-----------------------------------------------------------------------
PyTorch/cuBLAS (FP16) ★      0.1234      98.76     1245.67    baseline
HopperGemm Basic (FP16)      0.4567      53.34      654.32       27.0%
HopperGemm Opt (FP16)        0.2345      78.90      987.45       52.6%

Correctness Verification

Verify numerical accuracy against PyTorch/cuBLAS:

python benchmarks/correctness_test.py --m 128 256 512 1024 --dtype fp16

Output example:

Testing GEMM: M=1024, N=1024, K=1024, dtype=torch.float16
----------------------------------------------------------------------
  HopperGemm Basic: PASSED
    Relative error: 1.23e-06 (tolerance: 1.00e-03)
    Absolute error: 2.45e-05
  HopperGemm Hopper-Opt: PASSED
    Relative error: 3.45e-07 (tolerance: 1.00e-03)
    Absolute error: 1.23e-05

Summary: 2/2 tests passed

Benchmark Reference

For developing Hopper-optimized GEMM kernels, the primary benchmark is:

cuBLAS (via PyTorch)

NVIDIA's highly optimized library
Target: ~98 TFLOPS for FP16 on H100
Use torch.mm() or torch.matmul() as reference

CUTLASS (Optional)

NVIDIA's open-source template library
Excellent learning resource for GEMM optimization
Located in third_party/cutlass as git submodule
Hopper example: cutlass/examples/48_hopper_warp_specialized_gemm

Why Triton?

Approach	Pros	Cons
CUDA C++	Maximum control, best performance	Complex, requires deep GPU knowledge
Triton	Python-like syntax, easier optimization	Less control over low-level details
cuBLAS	Best performance out-of-box	Closed-source, limited flexibility

Triton provides a good balance:

Easier to write and understand
Can achieve near-cuBLAS performance
Great for learning GPU optimization concepts

Development Roadmap

Phase 1: Framework Setup ✅

Project structure and Python package
Basic Triton GEMM kernel
Benchmark framework with PyTorch comparison
Correctness verification

Phase 2: Hopper Optimization 🚧

Larger block sizes for Tensor Core
Multi-stage pipelining
Optimized memory access patterns
FP8 precision support

Phase 3: Advanced Features ⬜

Auto-tuning for different problem sizes
Batch GEMM support
Mixed precision accumulation
Stream-K decomposition

Target Performance

Implementation	Target TFLOPS (FP16)	Notes
cuBLAS (PyTorch)	~98	NVIDIA reference, highly optimized
CUTLASS Hopper	~95	Open-source, excellent learning
HopperGemm Goal	80+	80%+ cuBLAS efficiency

References

Triton Documentation - Official Triton docs
Triton GEMM Tutorial - Great starting point
NVIDIA CUTLASS - CUDA Templates for Linear Algebra
Hopper GEMM Example - CUTLASS Hopper kernel
cuBLAS Documentation - NVIDIA cuBLAS library

License

MIT License - see LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HopperGemm

Features

Project Structure

Prerequisites

Installation

Building CUTLASS Examples (Optional)

Quick Start

Benchmarks

Comprehensive Benchmark (Recommended)

Performance Benchmark

Correctness Verification

Benchmark Reference

cuBLAS (via PyTorch)

CUTLASS (Optional)

Why Triton?

Development Roadmap

Phase 1: Framework Setup ✅

Phase 2: Hopper Optimization 🚧

Phase 3: Advanced Features ⬜

Target Performance

References

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

HopperGemm

Features

Project Structure

Prerequisites

Installation

Building CUTLASS Examples (Optional)

Quick Start

Benchmarks

Comprehensive Benchmark (Recommended)

Performance Benchmark

Correctness Verification

Benchmark Reference

cuBLAS (via PyTorch)

CUTLASS (Optional)

Why Triton?

Development Roadmap

Phase 1: Framework Setup ✅

Phase 2: Hopper Optimization 🚧

Phase 3: Advanced Features ⬜

Target Performance

References

License