Skip to content

xg-chu/UniLS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking

       

Xuangeng Chu*1Ruicong Liu*1†Yifei Huang1Yun Liu2Yichen Peng3Bo Zheng2
1Shanda AI Research Tokyo, The University of Tokyo, 2Shanda AI Research Tokyo, 3Institute of Science Tokyo
*Equal contribution, Corresponding author

🤩 CVPR 2026 🤩

UniLS generates diverse and natural listening and speaking motions from audio.

Installation

Clone the project

git clone --recurse-submodules git@github.com:xg-chu/UniLS.git
cd UniLS

Build environment

conda env create -f environment.yml
conda activate unils

Or install manually:

pip install torch torchvision torchaudio
pip install accelerate transformers peft einops omegaconf lmdb tqdm scipy wandb

Pretrained Models

Download the pretrained UniLS models from HuggingFace.

For FLAME support, download the official FLAME 2020 model from FLAME official website, then convert it by running python tools/convert_flame.py. This will generate a flame_2020.pt file, which can be loaded without chumpy.

Data

Download the dataset from UniLS-Talk Dataset.

Training

UniLS follows a three-stage training pipeline:

Stage 1: Motion Codec (VAE)

python train.py -c unils_codec

Stage 2: Audio-Free Autoregressive Generator

Modify VAE_PATH path in the config file to point to the Stage 1 checkpoint, then run:

python train.py -c unils_freegen

Stage 3: Audio-Conditioned LoRA Fine-tuning

Modify PRETRAIN_PATH path in the config file to point to the Stage 2 checkpoint, then run:

python train.py -c unils_loragen

Evaluation

Run evaluation with multi-GPU support via Accelerate:

accelerate launch eval.py -r /path/to/checkpoint --tau 1.0 --cfg 1.5

You can also pass an external dataset config to override the checkpoint's dataset:

accelerate launch eval.py -r /path/to/checkpoint --dataset configs/dataset.yaml

Inference

From Dataset

Generate visualizations from the dataset:

python infer_dataset.py -r /path/to/checkpoint --clip_length 20 --tau 1.0 --cfg 1.5 --num_samples 32
  • --resume_path, -r: Path to the trained model checkpoint.
  • --dataset: Path to a dataset YAML config (optional, uses checkpoint config by default).
  • --clip_length: Duration of the generated clip in seconds (default: 20).
  • --tau: Temperature for sampling (default: 1.0).
  • --cfg: Classifier-free guidance scale (default: 1.5).
  • --num_samples, -n: Number of samples to generate (default: 32).
  • --dump_dir, -d: Output directory (default: ./render_results).

From Audio Files

Generate visualizations directly from audio files, supporting one or two speakers:

# Single speaker
python infer_audio.py -r /path/to/checkpoint -a speaker0.wav

# Two speakers (dyadic conversation)
python infer_audio.py -r /path/to/checkpoint -a speaker0.wav --audio2 speaker1.wav
  • --resume_path, -r: Path to the trained model checkpoint.
  • --audio, -a: Path to speaker 0 audio file.
  • --audio2: Path to speaker 1 audio file (optional; if omitted, only speaker 0 motion is generated).
  • --tau: Temperature for sampling (default: 1.0).
  • --cfg: Classifier-free guidance scale (default: 1.5).
  • --dump_dir, -d: Output directory (default: ./render_results).

Acknowledgements

Some part of our work is built based on FLAME. We also thank the following projects:

Citation

If you find our work useful in your research, please consider citing:

@misc{chu2025unils,
      title={UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking}, 
      author={Xuangeng Chu and Ruicong Liu and Yifei Huang and Yun Liu and Yichen Peng and Bo Zheng},
      year={2025},
      eprint={2512.09327},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.09327}, 
}

About

[CVPR 2026] UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages