Skip to content

Latest commit

 

History

History
219 lines (164 loc) · 6.53 KB

File metadata and controls

219 lines (164 loc) · 6.53 KB

HopperGemm

High-performance GEMM (General Matrix Multiply) kernels for NVIDIA Hopper architecture (SM90) using Triton.

Features

  • Triton-based: Written in Triton for easy development and optimization
  • Hopper Optimized: Designed for H100/H800 GPUs with Tensor Core support
  • Multiple Precisions: FP32, FP16, BF16, FP8 (E4M3/E5M2)
  • Benchmark Framework: Compare against PyTorch/cuBLAS
  • Correctness Verification: Automatic validation against reference implementations

Project Structure

HopperGemm/
├── pyproject.toml          # Python package configuration
├── src/
│   └── hoppergemm/
│       ├── __init__.py     # Package entry point
│       ├── kernels/
│       │   ├── __init__.py
│       │   ├── gemm.py         # Basic Triton GEMM kernel
│       │   └── gemm_hopper.py  # Hopper-optimized kernel
│       └── gemm.py         # High-level API (optional)
├── benchmarks/
│   ├── benchmark_gemm.py   # Performance benchmark (vs PyTorch/cuBLAS)
│   └── correctness_test.py # Correctness verification
├── tests/
│   └── test_gemm.py        # Unit tests
├── tools/
│   ├── setup_submodules.sh     # Initialize git submodules
│   ├── build_cutlass.sh        # Build CUTLASS examples
│   ├── benchmark_cublas.py     # cuBLAS benchmark script
│   ├── benchmark_cutlass.py    # CUTLASS benchmark script
│   └── run_all_benchmarks.py   # Comprehensive benchmark comparison
└── third_party/
    └── cutlass/            # CUTLASS submodule (reference)

Prerequisites

  • Python: 3.9+
  • PyTorch: 2.0+ (with CUDA support)
  • Triton: 3.0+
  • GPU: NVIDIA H100, H800, or H200 (SM90)
  • CUDA: 12.0+

Installation

# Clone repository with submodules
git clone --recursive https://github.com/xInference/HopperGemm.git
cd HopperGemm

# Or if already cloned without submodules, initialize them:
git submodule update --init --recursive

# Install package
pip install -e .

# Install dev dependencies
pip install -e ".[dev]"

Building CUTLASS Examples (Optional)

For benchmarking against CUTLASS reference implementations:

# Build CUTLASS examples (requires CUDA 12+ and SM90 GPU)
./tools/build_cutlass.sh

# Or build all examples (takes longer)
./tools/build_cutlass_all.sh

Quick Start

import torch
from hoppergemm import matmul, matmul_hopper

# Create input matrices
a = torch.randn(1024, 1024, dtype=torch.float16, device='cuda')
b = torch.randn(1024, 1024, dtype=torch.float16, device='cuda')

# Basic Triton GEMM
c = matmul(a, b)

# Hopper-optimized GEMM
c_opt = matmul_hopper(a, b)

# Compare with PyTorch (cuBLAS)
c_ref = torch.mm(a, b)

# Check correctness
print(f"Max difference: {(c - c_ref).abs().max().item()}")

Benchmarks

Comprehensive Benchmark (Recommended)

Compare all implementations at once:

# Run all benchmarks (HopperGemm, cuBLAS, CUTLASS)
python tools/run_all_benchmarks.py --m 1024 2048 4096 --n 1024 2048 4096 --k 1024 2048 4096 --dtype fp16

Performance Benchmark

Compare HopperGemm against PyTorch/cuBLAS:

python benchmarks/benchmark_gemm.py --m 1024 2048 4096 --n 1024 2048 4096 --k 1024 2048 4096 --dtype fp16

Output example:

Kernel                     Time (ms)     TFLOPS   BW (GB/s)   Efficiency
-----------------------------------------------------------------------
PyTorch/cuBLAS (FP16) ★      0.1234      98.76     1245.67    baseline
HopperGemm Basic (FP16)      0.4567      53.34      654.32       27.0%
HopperGemm Opt (FP16)        0.2345      78.90      987.45       52.6%

Correctness Verification

Verify numerical accuracy against PyTorch/cuBLAS:

python benchmarks/correctness_test.py --m 128 256 512 1024 --dtype fp16

Output example:

Testing GEMM: M=1024, N=1024, K=1024, dtype=torch.float16
----------------------------------------------------------------------
  HopperGemm Basic: PASSED
    Relative error: 1.23e-06 (tolerance: 1.00e-03)
    Absolute error: 2.45e-05
  HopperGemm Hopper-Opt: PASSED
    Relative error: 3.45e-07 (tolerance: 1.00e-03)
    Absolute error: 1.23e-05

Summary: 2/2 tests passed

Benchmark Reference

For developing Hopper-optimized GEMM kernels, the primary benchmark is:

cuBLAS (via PyTorch)

  • NVIDIA's highly optimized library
  • Target: ~98 TFLOPS for FP16 on H100
  • Use torch.mm() or torch.matmul() as reference

CUTLASS (Optional)

  • NVIDIA's open-source template library
  • Excellent learning resource for GEMM optimization
  • Located in third_party/cutlass as git submodule
  • Hopper example: cutlass/examples/48_hopper_warp_specialized_gemm

Why Triton?

Approach Pros Cons
CUDA C++ Maximum control, best performance Complex, requires deep GPU knowledge
Triton Python-like syntax, easier optimization Less control over low-level details
cuBLAS Best performance out-of-box Closed-source, limited flexibility

Triton provides a good balance:

  • Easier to write and understand
  • Can achieve near-cuBLAS performance
  • Great for learning GPU optimization concepts

Development Roadmap

Phase 1: Framework Setup ✅

  • Project structure and Python package
  • Basic Triton GEMM kernel
  • Benchmark framework with PyTorch comparison
  • Correctness verification

Phase 2: Hopper Optimization 🚧

  • Larger block sizes for Tensor Core
  • Multi-stage pipelining
  • Optimized memory access patterns
  • FP8 precision support

Phase 3: Advanced Features ⬜

  • Auto-tuning for different problem sizes
  • Batch GEMM support
  • Mixed precision accumulation
  • Stream-K decomposition

Target Performance

Implementation Target TFLOPS (FP16) Notes
cuBLAS (PyTorch) ~98 NVIDIA reference, highly optimized
CUTLASS Hopper ~95 Open-source, excellent learning
HopperGemm Goal 80+ 80%+ cuBLAS efficiency

References

License

MIT License - see LICENSE for details.