A comprehensive CUDA operators learning project implementing operators from simple to complex, with three implementations per operator: cuTe DSL, Triton, and CUTLASS.
Sloth/
├── operators/ # All operators (organized by difficulty)
│ ├── 01_elementwise/ # Elementwise operations (beginner)
│ │ └── add/
│ │ ├── cute/ # cuTe DSL implementation
│ │ ├── triton/ # Triton implementation
│ │ ├── cutlass/ # CUTLASS implementation
│ │ └── __init__.py # Unified Python interface
│ ├── 02_reduce/ # Reduction operations (intermediate)
│ ├── 03_gemm/ # Matrix multiplication (advanced)
│ ├── 04_conv/ # Convolution operations (advanced)
│ └── 05_attention/ # Attention mechanisms (advanced)
│
├── tests/ # pytest test suite
│ ├── correctness/ # Correctness tests
│ ├── performance/ # Performance tests (pytest-benchmark)
│ └── conftest.py
│
├── benchmarks/ # Benchmark framework
│ ├── configs/
│ │ └── workload.yaml # Workload configurations
│ ├── runners/ # Implementation runners
│ ├── compare/ # Comparison tools
│ ├── third_party/
│ │ └── LeetCUDA/ # Reference implementation (git submodule)
│ └── run_benchmark.py # CLI entry point
│
├── common/
│ ├── cpp/ # C++ utilities
│ │ ├── include/utils/
│ │ └── bindings.cpp # pybind11 bindings
│ └── python/sloth/ # Python package
│
├── scripts/ # Build and run scripts
│
├── pyproject.toml
├── setup.py
└── CMakeLists.txt
- Three implementations per operator: cuTe DSL, Triton, and CUTLASS
- Unified Python interface: Easy to switch between implementations
- Comprehensive testing: pytest for correctness, pytest-benchmark for performance
- Benchmark comparison: Compare against PyTorch and LeetCUDA with speedup reports
- Workload configuration: YAML-based workload definitions
- CUDA Toolkit 11.x+ (with compatible GPU, SM 80+ recommended)
- Python 3.9+
- PyTorch 2.0+
- Triton 2.1+
# Clone repository with submodules
git clone --recursive https://github.com/your-username/Sloth.git
# Or initialize submodules after cloning
git submodule update --init --recursive# Install Python dependencies
pip install -e ".[dev]"
# Build C++/CUDA extensions
python setup.py build_ext --inplaceimport torch
from operators.01_elementwise.add import add
a = torch.randn(1024, 1024, device="cuda")
b = torch.randn(1024, 1024, device="cuda")
# Use different implementations
result_triton = add(a, b, impl="triton")
result_torch = add(a, b, impl="torch")
result_cute = add(a, b, impl="cute") # requires C++ build
result_cutlass = add(a, b, impl="cutlass") # requires C++ build# Correctness tests
pytest tests/correctness/
# Performance tests
pytest tests/performance/# Benchmark single operator
python benchmarks/run_benchmark.py add
# Benchmark all operators
python benchmarks/run_benchmark.py --all
# Specify implementations
python benchmarks/run_benchmark.py add --impls sloth-triton,torch
# Save report
python benchmarks/run_benchmark.py add --output reports/add.md --format markdown================================================================================
BENCHMARK REPORT: add
================================================================================
Workload: medium (4096x4096, float32)
----------------------------------------------
| Implementation | Time (ms) | vs Torch |
|----------------|-----------|-------------|
| Sloth-triton | 0.312 | 1.45x |
| Sloth-cute | 0.350 | 1.28x |
| Sloth-cutlass | 0.380 | 1.18x |
| Torch | 0.450 | 1.00x |
----------------------------------------------
SUMMARY (Geometric Mean Speedup vs Torch)
----------------------------------------------
| Implementation | Speedup |
|----------------|-----------|
| Sloth-triton | 1.42x |
| Sloth-cute | 1.34x |
| Sloth-cutlass | 1.28x |
----------------------------------------------
- Create operator directory:
operators/<level>/<operator_name>/ - Create subdirectories:
cute/,triton/,cutlass/ - Implement each version with unified interface in
__init__.py - Add workload configuration in
benchmarks/configs/workload.yaml - Write tests in
tests/correctness/test_<operator>.py
| Level | Operator | Status |
|---|---|---|
| 01_elementwise | add | Done (template) |
| 01_elementwise | mul | Planned |
| 01_elementwise | relu | Planned |
| 02_reduce | sum | Planned |
| 02_reduce | softmax | Planned |
| 02_reduce | layer_norm | Planned |
| 03_gemm | gemm | Planned |
| 03_gemm | gemv | Planned |
| 04_conv | conv2d | Planned |
| 04_conv | conv3d | Planned |
| 05_attention | flash_attention | Planned |
MIT License