daVinci-MagiHuman

Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model

SII-GAIR & Sand.ai

✨ Highlights

🧠 Single-Stream Transformer — A unified 15B-parameter, 40-layer Transformer that jointly processes text, video, and audio via self-attention only. No cross-attention, no multi-stream complexity.
🎭 Exceptional Human-Centric Quality — Expressive facial performance, natural speech-expression coordination, realistic body motion, and accurate audio-video synchronization.
🌍 Multilingual — Supports Chinese (Mandarin & Cantonese), English, Japanese, Korean, German, and French.
⚡ Blazing Fast Inference — Generates a 5-second 256p video in 2 seconds and a 5-second 1080p video in 38 seconds on a single H100 GPU.
🏆 State-of-the-Art Results — Achieves 80.0% win rate vs Ovi 1.1 and 60.9% vs LTX 2.3 in pairwise human evaluation over 2,000 comparisons.
📦 Fully Open Source — We release the complete model stack: base model, distilled model, super-resolution model, and inference code.

🎬 Demo

video_1.mp4

video_2.mp4

video_3.mp4

video_4.mp4

video_5.MP4

video_6.mp4

video_7.mp4

🏗️ Architecture

daVinci-MagiHuman uses a single-stream Transformer that takes text tokens, a reference image latent, and noisy video and audio tokens as input, and jointly denoises the video and audio within a unified token sequence.

Key design choices:

Component	Description
🥪 Sandwich Architecture	First and last 4 layers use modality-specific projections; middle 32 layers share parameters across modalities
🕐 Timestep-Free Denoising	No explicit timestep embeddings — the model infers the denoising state directly from input latents
🔀 Per-Head Gating	Learned scalar gates with sigmoid activation on each attention head for training stability
🔗 Unified Conditioning	Denoising and reference signals handled through a minimal unified interface — no dedicated conditioning branches

📊 Performance

Quantitative Quality Benchmark

Model	Visual Quality ↑	Text Alignment ↑	Physical Consistency ↑	WER ↓
OVI 1.1	4.73	4.10	4.41	40.45%
LTX 2.3	4.76	4.12	4.56	19.23%
daVinci-MagiHuman	4.80	4.18	4.52	14.60%

Human Evaluation (2,000 Pairwise Comparisons)

Matchup	daVinci-MagiHuman Win	Tie	Opponent Win
vs Ovi 1.1	80.0%	8.2%	11.8%
vs LTX 2.3	60.9%	17.2%	21.9%

Inference Speed (5-second video, on a single H100 GPU)

Resolution	Base (s)	Super-Res (s)	Decode (s)	Total (s)
256p	1.6	—	0.4	2.0
540p	1.6	5.1	1.3	8.0
1080p	1.6	31.0	5.8	38.4

🚀 Efficient Inference Techniques

⚡ Latent-Space Super-Resolution — Two-stage pipeline: generate at low resolution, then refine in latent space, avoiding an extra VAE decode-encode round trip.
🔄 Turbo VAE Decoder — A lightweight re-trained decoder that substantially reduces decoding overhead.
🔧 Full-Graph Compilation — MagiCompiler fuses operators across Transformer layers for ~1.2x speedup.
💨 Distillation — DMD-2 distillation enables generation with only 8 denoising steps (no CFG), without sacrificing quality.

📦 Getting Started

Option 1: Docker (Recommended)

# Recommended: use the prebuilt MagiHuman image (supports full pipeline including SR 1080p)
docker pull sandai/magi-human:latest

docker run -it --gpus all --network host --ipc host \
  -v /path/to/repos:/workspace \
  -v /path/to/checkpoints:/models \
  --name my-magi-human \
  sandai/magi-human:latest \
  bash

# Install MagiCompiler
git clone https://github.com/SandAI-org/MagiCompiler.git
cd MagiCompiler
pip install -r requirements.txt
pip install .
cd ..

# Clone daVinci-MagiHuman
git clone https://github.com/GAIR-NLP/daVinci-MagiHuman
cd daVinci-MagiHuman

If you prefer manual setup, follow Option 2 (Conda) below.

Option 2: Conda

# Create environment
conda create -n davinci-magihuman python=3.12
conda activate davinci-magihuman
conda install ffmpeg

# Install PyTorch
pip install torch==2.10.0 torchvision==0.25.0 torchaudio==2.10.0

# Install Flash Attention (Hopper)
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention/hopper && python setup.py install && cd ../..

# Install MagiCompiler
git clone https://github.com/SandAI-org/MagiCompiler.git
cd MagiCompiler
pip install -r requirements.txt
pip install .
cd ..

# Clone and install daVinci-MagiHuman
git clone https://github.com/GAIR-NLP/daVinci-MagiHuman
cd daVinci-MagiHuman
pip install -r requirements.txt
pip install --no-deps -r requirements-nodeps.txt

# Optional (only for sr-1080p): Install MagiAttention
git clone --recursive https://github.com/SandAI-org/MagiAttention.git
cd MagiAttention
git checkout v1.0.5
git submodule update --init --recursive
pip install -r requirements.txt
pip install --no-build-isolation .

Download Model Checkpoints

Download the complete model stack from HuggingFace and update the paths in the config files under example/.

You will also need the following external models:

Model	Source
Text Encoder	t5gemma-9b-9b-ul2
Audio Model	stable-audio-open-1.0
VAE	Wan2.2-TI2V-5B

🎯 Usage

Before running, update the checkpoint paths in the config files (example/*/config.json) to point to your local model directory.

Note: The first run will be slower due to model compilation and cache warmup. Subsequent runs will match the reported inference speeds.

Base Model (256p)

bash example/base/run.sh

Distilled Model (256p, 8 steps, no CFG)

bash example/distill/run.sh

Super-Resolution to 540p

bash example/sr_540p/run.sh

Super-Resolution to 1080p

bash example/sr_1080p/run.sh

✍️ Prompt Guidance

daVinci-MagiHuman uses an Enhanced Prompt system that rewrites user inputs into detailed performance directions optimized for avatar-style video generation. For the full system prompt specification, see prompts/enhanced_prompt_design.md.

Below is a quick reference for writing effective prompts.

Output Structure

Every enhanced prompt has three parts:

Main Body (150–200 words) — A clinical, chronological description of the character's appearance, facial dynamics, vocal delivery, and static cinematography. Written in English regardless of dialogue language.

Dialogue — Repeats all spoken lines in a structured format:

Dialogue:
<character description, language>: "Line content"

Background Sound — Specifies the most prominent ambient sound:
```
Background Sound:
<Description of the background sound>
```
Use <No prominent background sound> if none.

Quick Example

User input: A man in a yellow shirt says "有的人在一起生活一辈子，还带着假面具呢"

Enhanced prompt (abbreviated):

A young man with short dark hair, wearing a bright yellow polo shirt, sits stationary. His disposition is earnest and slightly agitated... He speaks with a rapid, emphatic tone, his mouth opening wide as he says, "有的人在一起生活一辈子，还带着假面具呢..." His brow furrows, lip muscles showing distinct dynamics...

Dialogue: <Young man in yellow polo, Mandarin>: "有的人在一起生活一辈子，还带着假面具呢..."

Background Sound: <No prominent background sound>

🙏 Acknowledgements

We thank the open-source community, and in particular Wan2.2 and Turbo-VAED, for their valuable contributions.

📄 License

This project is released under the Apache License 2.0.

📖 Citation

@article{davinci-magihuman-2026,
  title={Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model},
  author={SII-GAIR and Sand. ai and Chern, Ethan and Teng, Hansi and Sun, Hanwen and Wang, Hao and Pan, Hong and Jia, Hongyu and Su, Jiadi and Li, Jin and Yu, Junjie and Liu, Lijie and Li, Lingzhi and Ye, Lyumanshan and Hu, Min and Wang, Qiangang and Qi, Quanwei and Chern, Steffi and Bu, Tao and Wang, Taoran and Xu, Teren and Zhang, Tianning and Mi, Tiantian and Xu, Weixian and Zhang, Wenqiang and Zhang, Wentai and Yi, Xianping and Cai, Xiaojie and Kang, Xiaoyang and Ma, Yan and Liu, Yixiu and Zhang, Yunbo and Huang, Yunpeng and Lin, Yutong and Tao, Zewei and Liu, Zhaoliang and Zhang, Zheng and Cen, Zhiyao and Yu, Zhixuan and Wang, Zhongshu and Hu, Zhulin and Zhou, Zijin and Guo, Zinan and Cao, Yue and Liu, Pengfei},
  journal={arXiv preprint arXiv:2603.21986},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
assets		assets
example		example
inference		inference
prompts		prompts
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
requirements-nodeps.txt		requirements-nodeps.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

daVinci-MagiHuman

Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model

✨ Highlights

🎬 Demo

🏗️ Architecture

📊 Performance

Quantitative Quality Benchmark

Human Evaluation (2,000 Pairwise Comparisons)

Inference Speed (5-second video, on a single H100 GPU)

🚀 Efficient Inference Techniques

📦 Getting Started

Option 1: Docker (Recommended)

Option 2: Conda

Download Model Checkpoints

🎯 Usage

✍️ Prompt Guidance

Output Structure

Quick Example

🙏 Acknowledgements

📄 License

📖 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

daVinci-MagiHuman

Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model

✨ Highlights

🎬 Demo

🏗️ Architecture

📊 Performance

Quantitative Quality Benchmark

Human Evaluation (2,000 Pairwise Comparisons)

Inference Speed (5-second video, on a single H100 GPU)

🚀 Efficient Inference Techniques

📦 Getting Started

Option 1: Docker (Recommended)

Option 2: Conda

Download Model Checkpoints

🎯 Usage

✍️ Prompt Guidance

Output Structure

Quick Example

🙏 Acknowledgements

📄 License

📖 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages