Skip to content

PFCCLab/sonic-moe

 
 

Repository files navigation

SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations

arXiv

SonicMoE is a simple but blazing-fast Mixture-of-Experts (MoE) implementation optimized for NVIDIA Hopper architecture GPUs. It mainly leverages CuTeDSL and Triton to deliver state-of-the-art performance through IO-aware optimizations. These 2 figures provide an overview of activation memory usage and training throughput.

image image

📦 Installation

Prerequisites

  • NVIDIA Hopper GPUs (H100, H200, etc.)
  • CUDA 12.9+
  • Python 3.12+
  • PyTorch 2.7+

Install from Source

# Clone the repository
git clone https://github.com/Dao-AILab/sonic-moe.git
cd sonic-moe

# Install dependencies
pip install -r requirements.txt

# Install SonicMoE
pip install -e .

🎯 Quick Start

Basic Usage

import torch
from sonicmoe import MoE, KernelBackendMoE
from sonicmoe.enums import ActivationType

# Create MoE layer
moe = MoE(
    num_experts=128,                           # Number of experts
    num_experts_per_tok=8,                     # Top-k experts per token
    hidden_size=4096,                          # Hidden dimension
    intermediate_size=1536,                    # Expert intermediate size
    activation_function=ActivationType.SWIGLU, # SwiGLU activation
    add_bias=False,                            # Add bias to linear layers
    std=0.02,                                  # Weight initialization std
).to(device="cuda", dtype=torch.bfloat16)

# Forward pass
x = torch.randn(32768, 4096, device="cuda", dtype=torch.bfloat16)
output, aux_loss = moe(x, kernel_backend_moe=KernelBackendMoE.sonicmoe)

🧪 Testing

Run the test suite to verify correctness:

make test

Example usage

  • SonicMoE with TC top-K choice routing (SwiGLU activation)
python benchmarks/moe-cute.py --thiek 32768,4096,1024,128,8 --activation swiglu
  • SonicMoE with token rounding routing (SwiGLU activation)
python benchmarks/moe-token-rounding.py --routing nr --thiekq 16384,4096,1024,256,8,128

🤝 Contributing

We welcome contributions! Please feel free to submit issues, feature requests, or pull requests.

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

📚 Citation

If you use SonicMoE in your research, please cite:

@misc{guo2025sonicmoeacceleratingmoeio,
      title={SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations}, 
      author={Wentao Guo and Mayank Mishra and Xinle Cheng and Ion Stoica and Tri Dao},
      year={2025},
      eprint={2512.14080},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2512.14080}, 
}

About

Accelerating MoE with IO and Tile-aware Optimizations

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 97.1%
  • Cuda 2.0%
  • Other 0.9%