SonicMoE is a simple but blazing-fast Mixture-of-Experts (MoE) implementation optimized for NVIDIA Hopper architecture GPUs. It mainly leverages CuTeDSL and Triton to deliver state-of-the-art performance through IO-aware optimizations. These 2 figures provide an overview of activation memory usage and training throughput.
- NVIDIA Hopper GPUs (H100, H200, etc.)
- CUDA 12.9+
- Python 3.12+
- PyTorch 2.7+
# Clone the repository
git clone https://github.com/Dao-AILab/sonic-moe.git
cd sonic-moe
# Install dependencies
pip install -r requirements.txt
# Install SonicMoE
pip install -e .import torch
from sonicmoe import MoE, KernelBackendMoE
from sonicmoe.enums import ActivationType
# Create MoE layer
moe = MoE(
num_experts=128, # Number of experts
num_experts_per_tok=8, # Top-k experts per token
hidden_size=4096, # Hidden dimension
intermediate_size=1536, # Expert intermediate size
activation_function=ActivationType.SWIGLU, # SwiGLU activation
add_bias=False, # Add bias to linear layers
std=0.02, # Weight initialization std
).to(device="cuda", dtype=torch.bfloat16)
# Forward pass
x = torch.randn(32768, 4096, device="cuda", dtype=torch.bfloat16)
output, aux_loss = moe(x, kernel_backend_moe=KernelBackendMoE.sonicmoe)Run the test suite to verify correctness:
make test- SonicMoE with TC top-K choice routing (SwiGLU activation)
python benchmarks/moe-cute.py --thiek 32768,4096,1024,128,8 --activation swiglu- SonicMoE with token rounding routing (SwiGLU activation)
python benchmarks/moe-token-rounding.py --routing nr --thiekq 16384,4096,1024,256,8,128We welcome contributions! Please feel free to submit issues, feature requests, or pull requests.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
If you use SonicMoE in your research, please cite:
@misc{guo2025sonicmoeacceleratingmoeio,
title={SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations},
author={Wentao Guo and Mayank Mishra and Xinle Cheng and Ion Stoica and Tri Dao},
year={2025},
eprint={2512.14080},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2512.14080},
}
