The idea is to automatically detect fusion opportunities, generate Triton kernels for each operation, fuse and auto-tune them for deep learning models to reduce memory traffic and improve GPU performance, similar to torch.compile.
This is done by the following steps:
- Extract the data dependency graph (via MaseGraph)
- Identify opportunities for kernel fusion
- Choose a tiling strategy for each fusion candidate
- Automatically generate the fused kernel Triton code
- Autotune the generated kernel configuration using the tiling strategy
- Reinsert the fused kernels, replacing the original graph nodes
We have targeted optimizations in Scientific Machine Learning models, obtained from the Neural-Solver-Library. More information on the decision to work with these models can be found here
We have tried our best to make each stage modular, where each module contains its own documentation, unit tests and benchmarking scripts.
The repo is structured as follows:
-
autofuser :: contains the code for the AutoFusion pipeline. The main entry point is
autofuse.py, wich treats the other modules as packages:neuralset:: Neural-Solver-Library model, data loading, and training utils.graph:: utils for graph preparation, finding fusion chains and graph rewritingtiling:: selects backend-specific tiling strategies and autotune search spacesfuser:: generates the fused Triton code from a fusion spec and tiling strategyautotune:: triton kernel autotuning for a given tiling strategy
-
experiments :: contains scripts to automatically run the fusion pipelilne for each model, and a given dataset. The model configuration is the reported best configuration from each paper.
As mentioned, each module has its own unit test suite, but additionally autofuse/test/ contains tests for the general pipeline.
Unit tests use pytest which can be run with pytest PATH/TO/TEST/FOLDER/
All dependencies can be installed via setup_env.sh.
If running on Google Colab, please open this notebook, which contains scripts to clone the repository, setup the environment and load all the data.
Data was obtained from the PDEBench [NeurIPS 2022 Track Datasets and Benchmarks] for benchmarking autoregressive tasks. We have tested our models, and developped scripts for the following data:
- airfoil, install and move this data to
data/airfoil - Navier-Stokes, install
NavierStokes_V1e-5_N1200_T20and move this data todata/ns
for testing autofusion for a given model on a given dataset, run
# For example, for FNO model on airfoil dataset
bash experiments/scripts/airfoil/FNO.shand to test all models for on a given dataset, run
# For example, run all models on airfoil dataset
bash experiments/scripts/run_all_experiments.sh scripts/airfoil