UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking

Xuangeng Chu^1 Ruicong Liu^1† Yifei Huang¹ Yun Liu² Yichen Peng³ Bo Zheng²
¹Shanda AI Research Tokyo, The University of Tokyo, ²Shanda AI Research Tokyo, ³Institute of Science Tokyo
^*Equal contribution, ^†Corresponding author

🤩 CVPR 2026 🤩

UniLS generates diverse and natural listening and speaking motions from audio.

Installation

Clone the project

git clone --recurse-submodules git@github.com:xg-chu/UniLS.git
cd UniLS

Build environment

conda env create -f environment.yml
conda activate unils

Or install manually:

pip install torch torchvision torchaudio
pip install accelerate transformers peft einops omegaconf lmdb tqdm scipy wandb

Pretrained Models

Download the pretrained UniLS models from HuggingFace.

For FLAME support, download the official FLAME 2020 model from FLAME official website, then convert it by running python tools/convert_flame.py. This will generate a flame_2020.pt file, which can be loaded without chumpy.

Data

Download the dataset from UniLS-Talk Dataset.

Training

UniLS follows a three-stage training pipeline:

Stage 1: Motion Codec (VAE)

python train.py -c unils_codec

Stage 2: Audio-Free Autoregressive Generator

Modify VAE_PATH path in the config file to point to the Stage 1 checkpoint, then run:

python train.py -c unils_freegen

Stage 3: Audio-Conditioned LoRA Fine-tuning

Modify PRETRAIN_PATH path in the config file to point to the Stage 2 checkpoint, then run:

python train.py -c unils_loragen

Evaluation

Run evaluation with multi-GPU support via Accelerate:

accelerate launch eval.py -r /path/to/checkpoint --tau 1.0 --cfg 1.5

You can also pass an external dataset config to override the checkpoint's dataset:

accelerate launch eval.py -r /path/to/checkpoint --dataset configs/dataset.yaml

Inference

From Dataset

Generate visualizations from the dataset:

python infer_dataset.py -r /path/to/checkpoint --clip_length 20 --tau 1.0 --cfg 1.5 --num_samples 32

--resume_path, -r: Path to the trained model checkpoint.
--dataset: Path to a dataset YAML config (optional, uses checkpoint config by default).
--clip_length: Duration of the generated clip in seconds (default: 20).
--tau: Temperature for sampling (default: 1.0).
--cfg: Classifier-free guidance scale (default: 1.5).
--num_samples, -n: Number of samples to generate (default: 32).
--dump_dir, -d: Output directory (default: ./render_results).

From Audio Files

Generate visualizations directly from audio files, supporting one or two speakers:

# Single speaker
python infer_audio.py -r /path/to/checkpoint -a speaker0.wav

# Two speakers (dyadic conversation)
python infer_audio.py -r /path/to/checkpoint -a speaker0.wav --audio2 speaker1.wav

--resume_path, -r: Path to the trained model checkpoint.
--audio, -a: Path to speaker 0 audio file.
--audio2: Path to speaker 1 audio file (optional; if omitted, only speaker 0 motion is generated).
--tau: Temperature for sampling (default: 1.0).
--cfg: Classifier-free guidance scale (default: 1.5).
--dump_dir, -d: Output directory (default: ./render_results).

Acknowledgements

Some part of our work is built based on FLAME. We also thank the following projects:

FLAME: https://flame.is.tue.mpg.de
EMICA: https://github.com/radekd91/inferno

Citation

If you find our work useful in your research, please consider citing:

@misc{chu2025unils,
      title={UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking}, 
      author={Xuangeng Chu and Ruicong Liu and Yifei Huang and Yun Liu and Yichen Peng and Bo Zheng},
      year={2025},
      eprint={2512.09327},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.09327}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking

Xuangeng Chu^1 Ruicong Liu^1† Yifei Huang¹ Yun Liu² Yichen Peng³ Bo Zheng²
¹Shanda AI Research Tokyo, The University of Tokyo, ²Shanda AI Research Tokyo, ³Institute of Science Tokyo
^*Equal contribution, ^†Corresponding author

🤩 CVPR 2026 🤩

Installation

Clone the project

Build environment

Pretrained Models

Data

Training

Evaluation

Inference

From Dataset

From Audio Files

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
configs		configs
core		core
tools		tools
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
eval.py		eval.py
infer_audio.py		infer_audio.py
infer_dataset.py		infer_dataset.py
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking

Xuangeng Chu*1 Ruicong Liu*1† Yifei Huang1 Yun Liu2 Yichen Peng3 Bo Zheng2 1Shanda AI Research Tokyo, The University of Tokyo, 2Shanda AI Research Tokyo, 3Institute of Science Tokyo *Equal contribution, †Corresponding author

🤩 CVPR 2026 🤩

Installation

Clone the project

Build environment

Pretrained Models

Data

Training

Evaluation

Inference

From Dataset

From Audio Files

Acknowledgements

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Xuangeng Chu^1 Ruicong Liu^1† Yifei Huang¹ Yun Liu² Yichen Peng³ Bo Zheng²
¹Shanda AI Research Tokyo, The University of Tokyo, ²Shanda AI Research Tokyo, ³Institute of Science Tokyo
^*Equal contribution, ^†Corresponding author

Packages