Skip to content

SandAI-org/MagiAttention

Repository files navigation

MagiAttention

paper docs blog license

blog product Hugging Face Twitter Follow Discord license

A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Mask Training

MagiAttention Overview

Latest News 🔥

  • [2026/02] 🎉 We release MagiAttention-v1.1.0 to: (1) add early support for Blackwell via a new attention kernel backend FFA_FA4 using forked Flash-Attention 4; (2) provide full support for native group collective kernels for both intranode and internode communication based upon DeepEP; (3) update the MagiAttention Blog with comprehensive Attention Benchmark on H100 and B200, demonstrating SOTA performance and near-linear scalability.
2025 News
  • [2025/11] 🚀 We release MagiAttention-v1.0.5 with native support for (distributed) learnable attention sink mechanism in both Flex-Flash-Attention and MagiAttention, plus a drop-in integration for Flash-Attention via our Extensions, alongside which we provide a blog post that shares our design insights and implementation details. Furthermore, we support native group collective kernels for intranode communication based on DeepEP as an experimental feature.
  • [2025/09] 📌 We release MagiAttention-v1.0.4 to update the API, support compilable and jit-built FFA, optimize the performance for sparse scenarios, reduce the workspace memory usage, and engage some experimental features in progress.
  • [2025/07] 🚀 We release MagiAttention-v1.0.3 with improvements including documentation, support for all four mask types with arbitary overlapping, deterministic mode, API updates, FFA performance enhancements with bug fixes, optimized dispatch solvers, hierarchical-comm support, and example codes to train Llama-3 1B model with MagiAttention + FSDP / Transformers.
  • [2025/06] 📌 We release MagiAttention-v1.0.2 to provide the example code to integrate Megatron-LM with MagiAttention with several training convergence experiments (see here for more details), with some bug fixes and a roadmap added.
  • [2025/05] 📌 We release MagiAttention-v1.0.1 to support overlapped q_ranges when all mask types are FULL, with some code cleanup and bug fixes.
  • [2025/04] 🎉 We release MagiAttention-v1.0.0 with its blog: a distributed attention towards linear scalability for ultra-long context, heterogeneous mask training.

About

MagiAttention is a next‑generation distributed attention mechanism—commonly called context‑parallel (CP)—that offers kernel‑level flexibility for diverse attention‑mask patterns while delivering linear scalability across distributed training setups. It is especially well suited for workloads involving ultra-long contexts and heterogeneous masks, e.g., autoregressive video generation with Magi-1.

Additionally, it integrates easily with mainstream training frameworks such as Megatron-LM, Pytorch FSDP and HuggingFace Transformers; see QuickStart for usage.

We are committed to continually improving the performance and generality of MagiAttention for the broader research community.

Stay tuned for exciting enhancements and new features on the horizon! Any feedback or contributions are very welcome!

Key Designs ✨

To achieve linear scalability in distributed attention, we implemented the following key design innovations:

  • Flexible Flash Attention Kernel. We introduce a generalized attention mask formulation namely AttnSlice with a tailed kernelFlex‑Flash‑Attention (FFA)—natively designed to enable compact expression of diverse mask types and make distributed mask partitioning tractable, with performance comparable to Flash-Attention 3 on Hopper GPUs, and preliminary support for Blackwell via a forked Flash-Attention 4.
  • Computation Load Balancing. With a fine-grained chunk‑level sharding strategy, we elaborate an efficient dispatch solver that ensures balanced computational workloads across each CP rank.
  • Zero-Redundant Communication. Instead of adopting the common Ring-style P2P communication pattern, we ropose two novel communication primitives, GroupCast and GroupReduce, realizing zero-redundant communication volume for both forward and backward passes.
  • Adaptive Multi-Stage Overlap. Leveraging the above enhancements, we further implement an adaptive multi-stage overlap strategy that schedules computation and communication to effectively hide latency and maximize utilization via either manual or automatic tuning.

If you are interested in the detailed methodology and implementation, please check our blog for more information.

Documentation 📚

We provide comprehensive documentation here for MagiAttention, including installation instructions, API references, usage examples, tuning guides, technical blogs, performance benchmarks, etc.

Installation ⚙️

Please refer to our Installation documentation for detailed instructions on how to install MagiAttention from source.

Quick Start 🚀

Please refer to our QuickStart documentation on how to get started with MagiAttention, with simple code snippets for basic usage and examples for integrating with popular training frameworks like Megatron-LM, Pytorch FSDP and HuggingFace Transformers.

Extensions 💡

We provide additional magi_attn_extensions to offer supplementary utilities based on magi_attention, such as FlashAttention with Attention Sink.

Future Work ⛏️

Please refer to our Future Work documentation for upcoming features and improvements.

Benchmark 📊

We present representative distributed-level benchmarks below for the most commonly used varlen causal mask on both H100 and B200 GPUs, highlighting MagiAttention’s performance and scalability versus other leading CP strategies.

For detailed performance benchmarks of MagiAttention on various hardware setups and (distributed) attention scenarios, please refer to our Benchmark blog.

H100

H100 varlen causal mask magi_attention fwd H100 varlen causal mask magi_attention bwd
Benchmarking MagiAttention's scalability against other leading CP strategies for varlen causal mask on H100.

B200

B200 varlen causal mask magi_attention fwd B200 varlen causal mask magi_attention bwd
Benchmarking MagiAttention's scalability against other leading CP strategies for varlen causal mask on B200.

Contributing 🤝

We welcome and value any contributions and collaborations. Please check out CONTRIBUTING.md for how to get involved.

WeChat Group 💬

To collect your valuable feedback and stay updated with the latest news, releases, and discussions about MagiAttention, join our official WeChat group by scanning the QR code below:

Citation 📝

If you find MagiAttention useful in your research, please cite:

@misc{magiattention2025,
  title={MagiAttention: A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Mask Training},
  author={Zewei, Tao and Yunpeng, Huang},
  year={2025},
  howpublished={\url{https://github.com/SandAI-org/MagiAttention/}},
}

Acknowledgement ❤️

We would like to thank everyone who contributed to the development of MagiAttention.

Core Contributors

Actively developing and maintaining the codebase.

Member Affiliations Email GitHub Account
Zewei Tao SandAI zeweitao@sand.ai littsk
Yunpeng Huang SandAI yunpenghuang@sand.ai Strivin0311
Qiangang Wang SandAI, Nanjing University 522024330081@smail.nju.edu.cn WT1W
Hanwen Sun Peking University sunhanwen@stu.pku.edu.cn hanwen-sun
Jin Li SandAI, Tsinghua University 2609835176@qq.com lijinnn
Tao Bu SandAI, Nanjing University 502024330002@smail.nju.edu.cn Big-TRex
Bowen Zeng Zhejiang University zbw.cs@zju.edu.cn KevinZeng08

Early-Stage Contributors

We are deeply grateful for their valuable contributions during the initial research and bootstrapping phases of MagiAttention.

Member Affiliations Email GitHub Account
WenYang Fang Nanjing University fwy@smail.nju.edu.cn kagami4243
Siyuang Yan Nanjing University siyuanyan@smail.nju.edu.cn FibonaccciYan
Zixu Jiang Nanjing University 522023330040@smail.nju.edu.cn 191220042
Dingkun Xu Nanjing University 211220090@smail.nju.edu.cn PureDimension
Mingyu Liang Nanjing University mingyuliang518@gmail.com gaomusiki
Jingwei Xu Nanjing University jingweix@nju.edu.cn paragonlight

Star History ⭐

License ⚖️

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

About

A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Data Training

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors