CUDA and Triton implementation of CNN with kernel fusion techniques for performance optimization.
This project implements a convolutional neural network (CNN) for MNIST digit classification using CUDA and Triton. The main focus is comparing kernel fusion techniques that combine activation and normalization operations to reduce memory bandwidth and improve performance.
- Mode 0 (Baseline): ReLU + LayerNorm => separate kernel launches
- Mode 1: GELU + LayerNorm => fused into single kernel
- Mode 2: EvoNorm-B0 => inherently fused activation-normalization from NeurIPS 2020
- Mode 3: Swish + LayerNorm => fused into single kernel
.
├── CUDA/
│ ├── CNN/ # Main CUDA implementation
│ │ ├── activation_kernels.cu # GELU, Swish activation functions
│ │ ├── basic_kernels.cu # Conv2D, ReLU, MaxPool, FC layer
│ │ ├── normalization_kernels.cu # LayerNorm, EvoNorm-B0
│ │ ├── fused_kernels.cu # Fused activation+normalization kernels
│ │ ├── training_kernels.cu # Backward pass, SGD optimizer
│ │ ├── model.cu # Model allocation and initialization
│ │ ├── forward_pass.cu # Forward pass for all 4 modes
│ │ ├── training_ops.cu # Training step implementation
│ │ ├── main_combined.cu # Main training and profiling program
│ │ ├── mnist_loader.cu # MNIST dataset loader
│ │ └── cuda_utils.cu # CUDA utilities and error checking
│ │
│ └── Gearing up/ # Practice/experimental code
│
├── Triton/
│ ├── CNN/ # Main Triton implementation
│ │ ├── activation_kernels.py # Triton activation kernels
│ │ ├── basic_layers.py # Triton basic layer kernels
│ │ ├── normalization_kernels.py # Triton normalization kernels
│ │ ├── fused_kernels.py # Triton fused kernels
│ │ ├── model.py # Model structure
│ │ ├── forward_pass.py # Forward pass implementation
│ │ ├── training_ops.py # Training operations
│ │ ├── main_combined.py # Main program
│ │ ├── mnist_loader.py # MNIST loader
│ │ └── triton_utils.py # Triton utilities
│ │
│ └── Gearing Up/ # Practice Triton code
│
├── Pytorch + CUDA/ # PyTorch reference implementation
│ ├── cnn_model.py # PyTorch CNN model
│ └── script.py # Training script
│
└── data/
└── MNIST/raw/ # MNIST dataset files
The CUDA implementation is split into modular components:
- Basic Kernels: Convolution, ReLU, MaxPooling, Fully Connected layers
- Activation Kernels: GELU (tanh approximation), Swish (x * sigmoid)
- Normalization Kernels: LayerNorm, EvoNorm-B0
- Fused Kernels: Single-kernel implementations combining activation + normalization
- Training Kernels: Backpropagation, gradient computation, SGD weight updates
Triton implementation mirrors the CUDA structure using Triton's Python-based kernel programming model.
Input (28x28x1)
↓
Conv2D (3x3, 16 filters)
↓
Activation + Normalization (varies by mode)
↓
MaxPool (2x2)
↓
Fully Connected (10 classes)
↓
Softmax
The implementations measure:
- Kernel execution time (ms)
- Memory throughput (GB/s)
- Training time per epoch
- Speedup vs baseline
- Test accuracy
Testing configurations:
- Batch sizes: 32, 64, 128, 256
- Channel counts: 8, 16, 32
- Image size: 28x28 (MNIST)
nvcc -o cnn_training CUDA/CNN/main_combined.cu -O3 -arch=sm_75
nvprof ./cnn_trainingpython Triton/CNN/main_combined.pyTraining achieves 95-96% accuracy on MNIST test set across all fusion modes.
Performance varies by configuration:
- Fused kernels reduce memory traffic by eliminating intermediate storage
- Speedup depends on batch size and tensor dimensions
- Complex activations (GELU, Swish) improve accuracy but cost performance
MNIST handwritten digit dataset:
- Training: 60,000 images
- Testing: 10,000 images
- Image size: 28x28 grayscale
- CUDA Toolkit 11.0+
- Python 3.8+
- Triton (for Triton implementation)
- PyTorch (for reference implementation)




