A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Mask Training
- [2026/02] 🎉 We release MagiAttention-v1.1.0 to: (1) add early support for Blackwell via a new attention kernel backend
FFA_FA4using forked Flash-Attention 4; (2) provide full support for native group collective kernels for both intranode and internode communication based upon DeepEP; (3) update the MagiAttention Blog with comprehensive Attention Benchmark on H100 and B200, demonstrating SOTA performance and near-linear scalability.
2025 News
- [2025/11] 🚀 We release MagiAttention-v1.0.5 with native support for (distributed) learnable attention sink mechanism in both Flex-Flash-Attention and MagiAttention, plus a drop-in integration for Flash-Attention via our Extensions, alongside which we provide a blog post that shares our design insights and implementation details. Furthermore, we support native group collective kernels for intranode communication based on DeepEP as an experimental feature.
- [2025/09] 📌 We release MagiAttention-v1.0.4 to update the API, support compilable and jit-built FFA, optimize the performance for sparse scenarios, reduce the workspace memory usage, and engage some experimental features in progress.
- [2025/07] 🚀 We release MagiAttention-v1.0.3 with improvements including documentation, support for all four mask types with arbitary overlapping, deterministic mode, API updates, FFA performance enhancements with bug fixes, optimized dispatch solvers, hierarchical-comm support, and example codes to train Llama-3 1B model with MagiAttention + FSDP / Transformers.
- [2025/06] 📌 We release MagiAttention-v1.0.2 to provide the example code to integrate Megatron-LM with MagiAttention with several training convergence experiments (see here for more details), with some bug fixes and a roadmap added.
- [2025/05] 📌 We release MagiAttention-v1.0.1 to support overlapped q_ranges when all mask types are
FULL, with some code cleanup and bug fixes. - [2025/04] 🎉 We release MagiAttention-v1.0.0 with its blog: a distributed attention towards linear scalability for ultra-long context, heterogeneous mask training.
MagiAttention is a next‑generation distributed attention mechanism—commonly called context‑parallel (CP)—that offers kernel‑level flexibility for diverse attention‑mask patterns while delivering linear scalability across distributed training setups. It is especially well suited for workloads involving ultra-long contexts and heterogeneous masks, e.g., autoregressive video generation with Magi-1.
Additionally, it integrates easily with mainstream training frameworks such as Megatron-LM, Pytorch FSDP and HuggingFace Transformers; see QuickStart for usage.
We are committed to continually improving the performance and generality of MagiAttention for the broader research community.
Stay tuned for exciting enhancements and new features on the horizon! Any feedback or contributions are very welcome!
To achieve linear scalability in distributed attention, we implemented the following key design innovations:
- Flexible Flash Attention Kernel. We introduce a generalized attention mask formulation namely
AttnSlicewith a tailed kernelFlex‑Flash‑Attention (FFA)—natively designed to enable compact expression of diverse mask types and make distributed mask partitioning tractable, with performance comparable to Flash-Attention 3 on Hopper GPUs, and preliminary support for Blackwell via a forked Flash-Attention 4. - Computation Load Balancing. With a fine-grained chunk‑level sharding strategy, we elaborate an efficient dispatch solver that ensures balanced computational workloads across each CP rank.
- Zero-Redundant Communication. Instead of adopting the common Ring-style P2P communication pattern, we ropose two novel communication primitives, GroupCast and GroupReduce, realizing zero-redundant communication volume for both forward and backward passes.
- Adaptive Multi-Stage Overlap. Leveraging the above enhancements, we further implement an adaptive multi-stage overlap strategy that schedules computation and communication to effectively hide latency and maximize utilization via either manual or automatic tuning.
If you are interested in the detailed methodology and implementation, please check our blog for more information.
We provide comprehensive documentation here for MagiAttention, including installation instructions, API references, usage examples, tuning guides, technical blogs, performance benchmarks, etc.
Please refer to our Installation documentation for detailed instructions on how to install MagiAttention from source.
Please refer to our QuickStart documentation on how to get started with MagiAttention, with simple code snippets for basic usage and examples for integrating with popular training frameworks like Megatron-LM, Pytorch FSDP and HuggingFace Transformers.
We provide additional magi_attn_extensions to offer supplementary utilities based on magi_attention, such as FlashAttention with Attention Sink.
Please refer to our Future Work documentation for upcoming features and improvements.
We present representative distributed-level benchmarks below for the most commonly used varlen causal mask on both H100 and B200 GPUs, highlighting MagiAttention’s performance and scalability versus other leading CP strategies.
For detailed performance benchmarks of MagiAttention on various hardware setups and (distributed) attention scenarios, please refer to our Benchmark blog.
We welcome and value any contributions and collaborations. Please check out CONTRIBUTING.md for how to get involved.
To collect your valuable feedback and stay updated with the latest news, releases, and discussions about MagiAttention, join our official WeChat group by scanning the QR code below:
If you find MagiAttention useful in your research, please cite:
@misc{magiattention2025,
title={MagiAttention: A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Mask Training},
author={Zewei, Tao and Yunpeng, Huang},
year={2025},
howpublished={\url{https://github.com/SandAI-org/MagiAttention/}},
}We would like to thank everyone who contributed to the development of MagiAttention.
Actively developing and maintaining the codebase.
| Member | Affiliations | GitHub Account | |
|---|---|---|---|
| Zewei Tao | SandAI | zeweitao@sand.ai | littsk |
| Yunpeng Huang | SandAI | yunpenghuang@sand.ai | Strivin0311 |
| Qiangang Wang | SandAI, Nanjing University | 522024330081@smail.nju.edu.cn | WT1W |
| Hanwen Sun | Peking University | sunhanwen@stu.pku.edu.cn | hanwen-sun |
| Jin Li | SandAI, Tsinghua University | 2609835176@qq.com | lijinnn |
| Tao Bu | SandAI, Nanjing University | 502024330002@smail.nju.edu.cn | Big-TRex |
| Bowen Zeng | Zhejiang University | zbw.cs@zju.edu.cn | KevinZeng08 |
We are deeply grateful for their valuable contributions during the initial research and bootstrapping phases of MagiAttention.
| Member | Affiliations | GitHub Account | |
|---|---|---|---|
| WenYang Fang | Nanjing University | fwy@smail.nju.edu.cn | kagami4243 |
| Siyuang Yan | Nanjing University | siyuanyan@smail.nju.edu.cn | FibonaccciYan |
| Zixu Jiang | Nanjing University | 522023330040@smail.nju.edu.cn | 191220042 |
| Dingkun Xu | Nanjing University | 211220090@smail.nju.edu.cn | PureDimension |
| Mingyu Liang | Nanjing University | mingyuliang518@gmail.com | gaomusiki |
| Jingwei Xu | Nanjing University | jingweix@nju.edu.cn | paragonlight |
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
