Skip to content

GAIR-NLP/daVinci-MagiHuman

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

35 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

cover


daVinci-MagiHuman

Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model

SII-GAIR Β &Β  Sand.ai

arXiv Demo Models License Python PyTorch

✨ Highlights

  • 🧠 Single-Stream Transformer β€” A unified 15B-parameter, 40-layer Transformer that jointly processes text, video, and audio via self-attention only. No cross-attention, no multi-stream complexity.
  • 🎭 Exceptional Human-Centric Quality β€” Expressive facial performance, natural speech-expression coordination, realistic body motion, and accurate audio-video synchronization.
  • 🌍 Multilingual β€” Supports Chinese (Mandarin & Cantonese), English, Japanese, Korean, German, and French.
  • ⚑ Blazing Fast Inference β€” Generates a 5-second 256p video in 2 seconds and a 5-second 1080p video in 38 seconds on a single H100 GPU.
  • πŸ† State-of-the-Art Results β€” Achieves 80.0% win rate vs Ovi 1.1 and 60.9% vs LTX 2.3 in pairwise human evaluation over 2,000 comparisons.
  • πŸ“¦ Fully Open Source β€” We release the complete model stack: base model, distilled model, super-resolution model, and inference code.

🎬 Demo

video_1.mp4
video_2.mp4
video_3.mp4
video_4.mp4
video_5.MP4
video_6.mp4
video_7.mp4

πŸ—οΈ Architecture

daVinci-MagiHuman uses a single-stream Transformer that takes text tokens, a reference image latent, and noisy video and audio tokens as input, and jointly denoises the video and audio within a unified token sequence.

Key design choices:

Component Description
πŸ₯ͺ Sandwich Architecture First and last 4 layers use modality-specific projections; middle 32 layers share parameters across modalities
πŸ• Timestep-Free Denoising No explicit timestep embeddings β€” the model infers the denoising state directly from input latents
πŸ”€ Per-Head Gating Learned scalar gates with sigmoid activation on each attention head for training stability
πŸ”— Unified Conditioning Denoising and reference signals handled through a minimal unified interface β€” no dedicated conditioning branches

πŸ“Š Performance

Quantitative Quality Benchmark

Model Visual Quality ↑ Text Alignment ↑ Physical Consistency ↑ WER ↓
OVI 1.1 4.73 4.10 4.41 40.45%
LTX 2.3 4.76 4.12 4.56 19.23%
daVinci-MagiHuman 4.80 4.18 4.52 14.60%

Human Evaluation (2,000 Pairwise Comparisons)

Matchup daVinci-MagiHuman Win Tie Opponent Win
vs Ovi 1.1 80.0% 8.2% 11.8%
vs LTX 2.3 60.9% 17.2% 21.9%

Inference Speed (5-second video, on a single H100 GPU)

Resolution Base (s) Super-Res (s) Decode (s) Total (s)
256p 1.6 β€” 0.4 2.0
540p 1.6 5.1 1.3 8.0
1080p 1.6 31.0 5.8 38.4

πŸš€ Efficient Inference Techniques

  • ⚑ Latent-Space Super-Resolution β€” Two-stage pipeline: generate at low resolution, then refine in latent space, avoiding an extra VAE decode-encode round trip.
  • πŸ”„ Turbo VAE Decoder β€” A lightweight re-trained decoder that substantially reduces decoding overhead.
  • πŸ”§ Full-Graph Compilation β€” MagiCompiler fuses operators across Transformer layers for ~1.2x speedup.
  • πŸ’¨ Distillation β€” DMD-2 distillation enables generation with only 8 denoising steps (no CFG), without sacrificing quality.

πŸ“¦ Getting Started

Option 1: Docker (Recommended)

# Recommended: use the prebuilt MagiHuman image (supports full pipeline including SR 1080p)
docker pull sandai/magi-human:latest

docker run -it --gpus all --network host --ipc host \
  -v /path/to/repos:/workspace \
  -v /path/to/checkpoints:/models \
  --name my-magi-human \
  sandai/magi-human:latest \
  bash

# Install MagiCompiler
git clone https://github.com/SandAI-org/MagiCompiler.git
cd MagiCompiler
pip install -r requirements.txt
pip install .
cd ..

# Clone daVinci-MagiHuman
git clone https://github.com/GAIR-NLP/daVinci-MagiHuman
cd daVinci-MagiHuman

If you prefer manual setup, follow Option 2 (Conda) below.

Option 2: Conda

# Create environment
conda create -n davinci-magihuman python=3.12
conda activate davinci-magihuman
conda install ffmpeg

# Install PyTorch
pip install torch==2.10.0 torchvision==0.25.0 torchaudio==2.10.0

# Install Flash Attention (Hopper)
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention/hopper && python setup.py install && cd ../..

# Install MagiCompiler
git clone https://github.com/SandAI-org/MagiCompiler.git
cd MagiCompiler
pip install -r requirements.txt
pip install .
cd ..

# Clone and install daVinci-MagiHuman
git clone https://github.com/GAIR-NLP/daVinci-MagiHuman
cd daVinci-MagiHuman
pip install -r requirements.txt
pip install --no-deps -r requirements-nodeps.txt

# Optional (only for sr-1080p): Install MagiAttention
git clone --recursive https://github.com/SandAI-org/MagiAttention.git
cd MagiAttention
git checkout v1.0.5
git submodule update --init --recursive
pip install -r requirements.txt
pip install --no-build-isolation .

Download Model Checkpoints

Download the complete model stack from HuggingFace and update the paths in the config files under example/.

You will also need the following external models:

Model Source
Text Encoder t5gemma-9b-9b-ul2
Audio Model stable-audio-open-1.0
VAE Wan2.2-TI2V-5B

🎯 Usage

Before running, update the checkpoint paths in the config files (example/*/config.json) to point to your local model directory.

Note: The first run will be slower due to model compilation and cache warmup. Subsequent runs will match the reported inference speeds.

Base Model (256p)

bash example/base/run.sh

Distilled Model (256p, 8 steps, no CFG)

bash example/distill/run.sh

Super-Resolution to 540p

bash example/sr_540p/run.sh

Super-Resolution to 1080p

bash example/sr_1080p/run.sh

✍️ Prompt Guidance

daVinci-MagiHuman uses an Enhanced Prompt system that rewrites user inputs into detailed performance directions optimized for avatar-style video generation. For the full system prompt specification, see prompts/enhanced_prompt_design.md.

Below is a quick reference for writing effective prompts.

Output Structure

Every enhanced prompt has three parts:

  1. Main Body (150–200 words) β€” A clinical, chronological description of the character's appearance, facial dynamics, vocal delivery, and static cinematography. Written in English regardless of dialogue language.

  2. Dialogue β€” Repeats all spoken lines in a structured format:

    Dialogue:
    <character description, language>: "Line content"
    
  3. Background Sound β€” Specifies the most prominent ambient sound:

    Background Sound:
    <Description of the background sound>
    

    Use <No prominent background sound> if none.

Quick Example

User input: A man in a yellow shirt says "ζœ‰ηš„δΊΊεœ¨δΈ€θ΅·η”Ÿζ΄»δΈ€θΎˆε­οΌŒθΏ˜εΈ¦η€ε‡ι’ε…·ε‘’"

Enhanced prompt (abbreviated):

A young man with short dark hair, wearing a bright yellow polo shirt, sits stationary. His disposition is earnest and slightly agitated... He speaks with a rapid, emphatic tone, his mouth opening wide as he says, "ζœ‰ ηš„ δΊΊ 在 δΈ€ θ΅· η”Ÿ ζ΄» δΈ€ 辈 子,还 εΈ¦ 着 假 青 ε…· ε‘’..." His brow furrows, lip muscles showing distinct dynamics...

Dialogue: <Young man in yellow polo, Mandarin>: "ζœ‰ ηš„ δΊΊ 在 δΈ€ θ΅· η”Ÿ ζ΄» δΈ€ 辈 子,还 εΈ¦ 着 假 青 ε…· ε‘’..."

Background Sound: <No prominent background sound>

πŸ™ Acknowledgements

We thank the open-source community, and in particular Wan2.2 and Turbo-VAED, for their valuable contributions.

πŸ“„ License

This project is released under the Apache License 2.0.

πŸ“– Citation

@article{davinci-magihuman-2026,
  title={Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model},
  author={SII-GAIR and Sand. ai and Chern, Ethan and Teng, Hansi and Sun, Hanwen and Wang, Hao and Pan, Hong and Jia, Hongyu and Su, Jiadi and Li, Jin and Yu, Junjie and Liu, Lijie and Li, Lingzhi and Ye, Lyumanshan and Hu, Min and Wang, Qiangang and Qi, Quanwei and Chern, Steffi and Bu, Tao and Wang, Taoran and Xu, Teren and Zhang, Tianning and Mi, Tiantian and Xu, Weixian and Zhang, Wenqiang and Zhang, Wentai and Yi, Xianping and Cai, Xiaojie and Kang, Xiaoyang and Ma, Yan and Liu, Yixiu and Zhang, Yunbo and Huang, Yunpeng and Lin, Yutong and Tao, Zewei and Liu, Zhaoliang and Zhang, Zheng and Cen, Zhiyao and Yu, Zhixuan and Wang, Zhongshu and Hu, Zhulin and Zhou, Zijin and Guo, Zinan and Cao, Yue and Liu, Pengfei},
  journal={arXiv preprint arXiv:2603.21986},
  year={2026}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors