Sloth - CUDA Operators Learning Project

A comprehensive CUDA operators learning project implementing operators from simple to complex, with three implementations per operator: cuTe DSL, Triton, and CUTLASS.

Project Structure

Sloth/
├── operators/                    # All operators (organized by difficulty)
│   ├── 01_elementwise/           # Elementwise operations (beginner)
│   │   └── add/
│   │       ├── cute/             # cuTe DSL implementation
│   │       ├── triton/           # Triton implementation
│   │       ├── cutlass/          # CUTLASS implementation
│   │       └── __init__.py       # Unified Python interface
│   ├── 02_reduce/                # Reduction operations (intermediate)
│   ├── 03_gemm/                  # Matrix multiplication (advanced)
│   ├── 04_conv/                  # Convolution operations (advanced)
│   └── 05_attention/             # Attention mechanisms (advanced)
│
├── tests/                        # pytest test suite
│   ├── correctness/              # Correctness tests
│   ├── performance/              # Performance tests (pytest-benchmark)
│   └── conftest.py
│
├── benchmarks/                   # Benchmark framework
│   ├── configs/
│   │   └── workload.yaml         # Workload configurations
│   ├── runners/                  # Implementation runners
│   ├── compare/                  # Comparison tools
│   ├── third_party/
│   │   └── LeetCUDA/             # Reference implementation (git submodule)
│   └── run_benchmark.py          # CLI entry point
│
├── common/
│   ├── cpp/                      # C++ utilities
│   │   ├── include/utils/
│   │   └── bindings.cpp          # pybind11 bindings
│   └── python/sloth/             # Python package
│
├── scripts/                      # Build and run scripts
│
├── pyproject.toml
├── setup.py
└── CMakeLists.txt

Features

Three implementations per operator: cuTe DSL, Triton, and CUTLASS
Unified Python interface: Easy to switch between implementations
Comprehensive testing: pytest for correctness, pytest-benchmark for performance
Benchmark comparison: Compare against PyTorch and LeetCUDA with speedup reports
Workload configuration: YAML-based workload definitions

Installation

Prerequisites

CUDA Toolkit 11.x+ (with compatible GPU, SM 80+ recommended)
Python 3.9+
PyTorch 2.0+
Triton 2.1+

Clone with Submodules

# Clone repository with submodules
git clone --recursive https://github.com/your-username/Sloth.git

# Or initialize submodules after cloning
git submodule update --init --recursive

Build

# Install Python dependencies
pip install -e ".[dev]"

# Build C++/CUDA extensions
python setup.py build_ext --inplace

Usage

Running Operators

import torch
from operators.01_elementwise.add import add

a = torch.randn(1024, 1024, device="cuda")
b = torch.randn(1024, 1024, device="cuda")

# Use different implementations
result_triton = add(a, b, impl="triton")
result_torch = add(a, b, impl="torch")
result_cute = add(a, b, impl="cute")  # requires C++ build
result_cutlass = add(a, b, impl="cutlass")  # requires C++ build

Running Tests

# Correctness tests
pytest tests/correctness/

# Performance tests
pytest tests/performance/

Running Benchmarks

# Benchmark single operator
python benchmarks/run_benchmark.py add

# Benchmark all operators
python benchmarks/run_benchmark.py --all

# Specify implementations
python benchmarks/run_benchmark.py add --impls sloth-triton,torch

# Save report
python benchmarks/run_benchmark.py add --output reports/add.md --format markdown

Benchmark Output Example

================================================================================
                           BENCHMARK REPORT: add
================================================================================

Workload: medium (4096x4096, float32)
----------------------------------------------
| Implementation | Time (ms) | vs Torch    |
|----------------|-----------|-------------|
| Sloth-triton   | 0.312     | 1.45x       |
| Sloth-cute     | 0.350     | 1.28x       |
| Sloth-cutlass  | 0.380     | 1.18x       |
| Torch          | 0.450     | 1.00x       |
----------------------------------------------

SUMMARY (Geometric Mean Speedup vs Torch)
----------------------------------------------
| Implementation | Speedup   |
|----------------|-----------|
| Sloth-triton   | 1.42x     |
| Sloth-cute     | 1.34x     |
| Sloth-cutlass  | 1.28x     |
----------------------------------------------

Adding a New Operator

Create operator directory: operators/<level>/<operator_name>/
Create subdirectories: cute/, triton/, cutlass/
Implement each version with unified interface in __init__.py
Add workload configuration in benchmarks/configs/workload.yaml
Write tests in tests/correctness/test_<operator>.py

Operators Roadmap

Level	Operator	Status
01_elementwise	add	Done (template)
01_elementwise	mul	Planned
01_elementwise	relu	Planned
02_reduce	sum	Planned
02_reduce	softmax	Planned
02_reduce	layer_norm	Planned
03_gemm	gemm	Planned
03_gemm	gemv	Planned
04_conv	conv2d	Planned
04_conv	conv3d	Planned
05_attention	flash_attention	Planned

References

cuTe DSL
Triton
CUTLASS
LeetCUDA (reference implementation)

License

MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sloth - CUDA Operators Learning Project

Project Structure

Features

Installation

Prerequisites

Clone with Submodules

Build

Usage

Running Operators

Running Tests

Running Benchmarks

Benchmark Output Example

Adding a New Operator

Operators Roadmap

References

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Sloth - CUDA Operators Learning Project

Project Structure

Features

Installation

Prerequisites

Clone with Submodules

Build

Usage

Running Operators

Running Tests

Running Benchmarks

Benchmark Output Example

Adding a New Operator

Operators Roadmap

References

License