HopperGemm

High-performance GEMM (General Matrix Multiply) kernels for NVIDIA Hopper architecture (SM90) using Triton.

Features

Triton-based: Written in Triton for easy development and optimization
Hopper Optimized: Designed for H100/H800 GPUs with Tensor Core support
Multiple Precisions: FP32, FP16, BF16, FP8 (E4M3/E5M2)
Benchmark Framework: Compare against PyTorch/cuBLAS
Correctness Verification: Automatic validation against reference implementations

Project Structure

HopperGemm/
├── pyproject.toml          # Python package configuration
├── src/
│   └── hoppergemm/
│       ├── __init__.py     # Package entry point
│       ├── kernels/
│       │   ├── __init__.py
│       │   ├── gemm.py         # Basic Triton GEMM kernel
│       │   └── gemm_hopper.py  # Hopper-optimized kernel
│       └── gemm.py         # High-level API (optional)
├── benchmarks/
│   ├── benchmark_gemm.py   # Performance benchmark (vs PyTorch/cuBLAS)
│   └── correctness_test.py # Correctness verification
├── tests/
│   └── test_gemm.py        # Unit tests
├── tools/
│   ├── setup_submodules.sh     # Initialize git submodules
│   ├── build_cutlass.sh        # Build CUTLASS examples
│   ├── benchmark_cublas.py     # cuBLAS benchmark script
│   ├── benchmark_cutlass.py    # CUTLASS benchmark script
│   └── run_all_benchmarks.py   # Comprehensive benchmark comparison
└── third_party/
    └── cutlass/            # CUTLASS submodule (reference)

Prerequisites

Python: 3.9+
PyTorch: 2.0+ (with CUDA support)
Triton: 3.0+
GPU: NVIDIA H100, H800, or H200 (SM90)
CUDA: 12.0+

Installation

# Clone repository with submodules
git clone --recursive https://github.com/xInference/HopperGemm.git
cd HopperGemm

# Or if already cloned without submodules, initialize them:
git submodule update --init --recursive

# Install package
pip install -e .

# Install dev dependencies
pip install -e ".[dev]"

Building CUTLASS Examples (Optional)

For benchmarking against CUTLASS reference implementations:

# Build CUTLASS examples (requires CUDA 12+ and SM90 GPU)
./tools/build_cutlass.sh

# Or build all examples (takes longer)
./tools/build_cutlass_all.sh

Quick Start

import torch
from hoppergemm import matmul, matmul_hopper

# Create input matrices
a = torch.randn(1024, 1024, dtype=torch.float16, device='cuda')
b = torch.randn(1024, 1024, dtype=torch.float16, device='cuda')

# Basic Triton GEMM
c = matmul(a, b)

# Hopper-optimized GEMM
c_opt = matmul_hopper(a, b)

# Compare with PyTorch (cuBLAS)
c_ref = torch.mm(a, b)

# Check correctness
print(f"Max difference: {(c - c_ref).abs().max().item()}")

Benchmarks

Comprehensive Benchmark (Recommended)

Compare all implementations at once:

# Run all benchmarks (HopperGemm, cuBLAS, CUTLASS)
python tools/run_all_benchmarks.py --m 1024 2048 4096 --n 1024 2048 4096 --k 1024 2048 4096 --dtype fp16

Performance Benchmark

Compare HopperGemm against PyTorch/cuBLAS:

python benchmarks/benchmark_gemm.py --m 1024 2048 4096 --n 1024 2048 4096 --k 1024 2048 4096 --dtype fp16

Output example:

Kernel                     Time (ms)     TFLOPS   BW (GB/s)   Efficiency
-----------------------------------------------------------------------
PyTorch/cuBLAS (FP16) ★      0.1234      98.76     1245.67    baseline
HopperGemm Basic (FP16)      0.4567      53.34      654.32       27.0%
HopperGemm Opt (FP16)        0.2345      78.90      987.45       52.6%

Correctness Verification

Verify numerical accuracy against PyTorch/cuBLAS:

python benchmarks/correctness_test.py --m 128 256 512 1024 --dtype fp16

Output example:

Testing GEMM: M=1024, N=1024, K=1024, dtype=torch.float16
----------------------------------------------------------------------
  HopperGemm Basic: PASSED
    Relative error: 1.23e-06 (tolerance: 1.00e-03)
    Absolute error: 2.45e-05
  HopperGemm Hopper-Opt: PASSED
    Relative error: 3.45e-07 (tolerance: 1.00e-03)
    Absolute error: 1.23e-05

Summary: 2/2 tests passed

Benchmark Reference

For developing Hopper-optimized GEMM kernels, the primary benchmark is:

cuBLAS (via PyTorch)

NVIDIA's highly optimized library
Target: ~98 TFLOPS for FP16 on H100
Use torch.mm() or torch.matmul() as reference

CUTLASS (Optional)

NVIDIA's open-source template library
Excellent learning resource for GEMM optimization
Located in third_party/cutlass as git submodule
Hopper example: cutlass/examples/48_hopper_warp_specialized_gemm

Why Triton?

Approach	Pros	Cons
CUDA C++	Maximum control, best performance	Complex, requires deep GPU knowledge
Triton	Python-like syntax, easier optimization	Less control over low-level details
cuBLAS	Best performance out-of-box	Closed-source, limited flexibility

Triton provides a good balance:

Easier to write and understand
Can achieve near-cuBLAS performance
Great for learning GPU optimization concepts

Development Roadmap

Phase 1: Framework Setup ✅

Project structure and Python package
Basic Triton GEMM kernel
Benchmark framework with PyTorch comparison
Correctness verification

Phase 2: Hopper Optimization 🚧

Larger block sizes for Tensor Core
Multi-stage pipelining
Optimized memory access patterns
FP8 precision support

Phase 3: Advanced Features ⬜

Auto-tuning for different problem sizes
Batch GEMM support
Mixed precision accumulation
Stream-K decomposition

Target Performance

Implementation	Target TFLOPS (FP16)	Notes
cuBLAS (PyTorch)	~98	NVIDIA reference, highly optimized
CUTLASS Hopper	~95	Open-source, excellent learning
HopperGemm Goal	80+	80%+ cuBLAS efficiency

References

Triton Documentation - Official Triton docs
Triton GEMM Tutorial - Great starting point
NVIDIA CUTLASS - CUDA Templates for Linear Algebra
Hopper GEMM Example - CUTLASS Hopper kernel
cuBLAS Documentation - NVIDIA cuBLAS library

License

MIT License - see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.idea		.idea
benchmarks		benchmarks
src/hoppergemm		src/hoppergemm
tests		tests
third_party		third_party
tools		tools
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HopperGemm

Features

Project Structure

Prerequisites

Installation

Building CUTLASS Examples (Optional)

Quick Start

Benchmarks

Comprehensive Benchmark (Recommended)

Performance Benchmark

Correctness Verification

Benchmark Reference

cuBLAS (via PyTorch)

CUTLASS (Optional)

Why Triton?

Development Roadmap

Phase 1: Framework Setup ✅

Phase 2: Hopper Optimization 🚧

Phase 3: Advanced Features ⬜

Target Performance

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HopperGemm

Features

Project Structure

Prerequisites

Installation

Building CUTLASS Examples (Optional)

Quick Start

Benchmarks

Comprehensive Benchmark (Recommended)

Performance Benchmark

Correctness Verification

Benchmark Reference

cuBLAS (via PyTorch)

CUTLASS (Optional)

Why Triton?

Development Roadmap

Phase 1: Framework Setup ✅

Phase 2: Hopper Optimization 🚧

Phase 3: Advanced Features ⬜

Target Performance

References

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages