MagiAttention

A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Mask Training

Latest News 🔥

[2026/02] 🎉 We release MagiAttention-v1.1.0 to: (1) add early support for Blackwell via a new attention kernel backend FFA_FA4 using forked Flash-Attention 4; (2) provide full support for native group collective kernels for both intranode and internode communication based upon DeepEP; (3) update the MagiAttention Blog with comprehensive Attention Benchmark on H100 and B200, demonstrating SOTA performance and near-linear scalability.

2025 News

[2025/11] 🚀 We release MagiAttention-v1.0.5 with native support for (distributed) learnable attention sink mechanism in both Flex-Flash-Attention and MagiAttention, plus a drop-in integration for Flash-Attention via our Extensions, alongside which we provide a blog post that shares our design insights and implementation details. Furthermore, we support native group collective kernels for intranode communication based on DeepEP as an experimental feature.
[2025/09] 📌 We release MagiAttention-v1.0.4 to update the API, support compilable and jit-built FFA, optimize the performance for sparse scenarios, reduce the workspace memory usage, and engage some experimental features in progress.
[2025/07] 🚀 We release MagiAttention-v1.0.3 with improvements including documentation, support for all four mask types with arbitary overlapping, deterministic mode, API updates, FFA performance enhancements with bug fixes, optimized dispatch solvers, hierarchical-comm support, and example codes to train Llama-3 1B model with MagiAttention + FSDP / Transformers.
[2025/06] 📌 We release MagiAttention-v1.0.2 to provide the example code to integrate Megatron-LM with MagiAttention with several training convergence experiments (see here for more details), with some bug fixes and a roadmap added.
[2025/05] 📌 We release MagiAttention-v1.0.1 to support overlapped q_ranges when all mask types are FULL, with some code cleanup and bug fixes.
[2025/04] 🎉 We release MagiAttention-v1.0.0 with its blog: a distributed attention towards linear scalability for ultra-long context, heterogeneous mask training.

About

MagiAttention is a next‑generation distributed attention mechanism—commonly called context‑parallel (CP)—that offers kernel‑level flexibility for diverse attention‑mask patterns while delivering linear scalability across distributed training setups. It is especially well suited for workloads involving ultra-long contexts and heterogeneous masks, e.g., autoregressive video generation with Magi-1.

Additionally, it integrates easily with mainstream training frameworks such as Megatron-LM, Pytorch FSDP and HuggingFace Transformers; see QuickStart for usage.

We are committed to continually improving the performance and generality of MagiAttention for the broader research community.

Stay tuned for exciting enhancements and new features on the horizon! Any feedback or contributions are very welcome!

Key Designs ✨

To achieve linear scalability in distributed attention, we implemented the following key design innovations:

Flexible Flash Attention Kernel. We introduce a generalized attention mask formulation namely AttnSlice with a tailed kernelFlex‑Flash‑Attention (FFA)—natively designed to enable compact expression of diverse mask types and make distributed mask partitioning tractable, with performance comparable to Flash-Attention 3 on Hopper GPUs, and preliminary support for Blackwell via a forked Flash-Attention 4.
Computation Load Balancing. With a fine-grained chunk‑level sharding strategy, we elaborate an efficient dispatch solver that ensures balanced computational workloads across each CP rank.
Zero-Redundant Communication. Instead of adopting the common Ring-style P2P communication pattern, we ropose two novel communication primitives, GroupCast and GroupReduce, realizing zero-redundant communication volume for both forward and backward passes.
Adaptive Multi-Stage Overlap. Leveraging the above enhancements, we further implement an adaptive multi-stage overlap strategy that schedules computation and communication to effectively hide latency and maximize utilization via either manual or automatic tuning.

If you are interested in the detailed methodology and implementation, please check our blog for more information.

Documentation 📚

We provide comprehensive documentation here for MagiAttention, including installation instructions, API references, usage examples, tuning guides, technical blogs, performance benchmarks, etc.

Installation ⚙️

Please refer to our Installation documentation for detailed instructions on how to install MagiAttention from source.

Quick Start 🚀

Please refer to our QuickStart documentation on how to get started with MagiAttention, with simple code snippets for basic usage and examples for integrating with popular training frameworks like Megatron-LM, Pytorch FSDP and HuggingFace Transformers.

Extensions 💡

We provide additional magi_attn_extensions to offer supplementary utilities based on magi_attention, such as FlashAttention with Attention Sink.

Future Work ⛏️

Please refer to our Future Work documentation for upcoming features and improvements.

Benchmark 📊

We present representative distributed-level benchmarks below for the most commonly used varlen causal mask on both H100 and B200 GPUs, highlighting MagiAttention’s performance and scalability versus other leading CP strategies.

For detailed performance benchmarks of MagiAttention on various hardware setups and (distributed) attention scenarios, please refer to our Benchmark blog.

H100

Benchmarking MagiAttention's scalability against other leading CP strategies for varlen causal mask on H100.

B200

Benchmarking MagiAttention's scalability against other leading CP strategies for varlen causal mask on B200.

Contributing 🤝

We welcome and value any contributions and collaborations. Please check out CONTRIBUTING.md for how to get involved.

WeChat Group 💬

To collect your valuable feedback and stay updated with the latest news, releases, and discussions about MagiAttention, join our official WeChat group by scanning the QR code below:

Citation 📝

If you find MagiAttention useful in your research, please cite:

@misc{magiattention2025,
  title={MagiAttention: A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Mask Training},
  author={Zewei, Tao and Yunpeng, Huang},
  year={2025},
  howpublished={\url{https://github.com/SandAI-org/MagiAttention/}},
}

Acknowledgement ❤️

We would like to thank everyone who contributed to the development of MagiAttention.

Core Contributors

Actively developing and maintaining the codebase.

Member	Affiliations	Email	GitHub Account
Zewei Tao	SandAI	zeweitao@sand.ai	littsk
Yunpeng Huang	SandAI	yunpenghuang@sand.ai	Strivin0311
Qiangang Wang	SandAI, Nanjing University	522024330081@smail.nju.edu.cn	WT1W
Hanwen Sun	Peking University	sunhanwen@stu.pku.edu.cn	hanwen-sun
Jin Li	SandAI, Tsinghua University	2609835176@qq.com	lijinnn
Tao Bu	SandAI, Nanjing University	502024330002@smail.nju.edu.cn	Big-TRex
Bowen Zeng	Zhejiang University	zbw.cs@zju.edu.cn	KevinZeng08

Early-Stage Contributors

We are deeply grateful for their valuable contributions during the initial research and bootstrapping phases of MagiAttention.

Member	Affiliations	Email	GitHub Account
WenYang Fang	Nanjing University	fwy@smail.nju.edu.cn	kagami4243
Siyuang Yan	Nanjing University	siyuanyan@smail.nju.edu.cn	FibonaccciYan
Zixu Jiang	Nanjing University	522023330040@smail.nju.edu.cn	191220042
Dingkun Xu	Nanjing University	211220090@smail.nju.edu.cn	PureDimension
Mingyu Liang	Nanjing University	mingyuliang518@gmail.com	gaomusiki
Jingwei Xu	Nanjing University	jingweix@nju.edu.cn	paragonlight

Star History ⭐

License ⚖️

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 244 Commits
.github		.github
assets		assets
docs		docs
examples		examples
exps		exps
extensions		extensions
magi_attention		magi_attention
scripts		scripts
tests		tests
tools		tools
.clang-format		.clang-format
.coveragerc		.coveragerc
.flake8		.flake8
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
conftest.py		conftest.py
makefile		makefile
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
requirements_dev.txt		requirements_dev.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MagiAttention

A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Mask Training

Latest News 🔥

About

Key Designs ✨

Documentation 📚

Installation ⚙️

Quick Start 🚀

Extensions 💡

Future Work ⛏️

Benchmark 📊

H100

B200

Contributing 🤝

WeChat Group 💬

Citation 📝

Acknowledgement ❤️

Core Contributors

Early-Stage Contributors

Star History ⭐

License ⚖️

About

Uh oh!

Releases 7

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MagiAttention

A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Mask Training

Latest News 🔥

About

Key Designs ✨

Documentation 📚

Installation ⚙️

Quick Start 🚀

Extensions 💡

Future Work ⛏️

Benchmark 📊

H100

B200

Contributing 🤝

WeChat Group 💬

Citation 📝

Acknowledgement ❤️

Core Contributors

Early-Stage Contributors

Star History ⭐

License ⚖️

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages