Xuangeng Chu*1
Ruicong Liu*1†
Yifei Huang1
Yun Liu2
Yichen Peng3
Bo Zheng2
1Shanda AI Research Tokyo, The University of Tokyo,
2Shanda AI Research Tokyo,
3Institute of Science Tokyo
*Equal contribution,
†Corresponding author
git clone --recurse-submodules git@github.com:xg-chu/UniLS.git
cd UniLS
conda env create -f environment.yml
conda activate unils
Or install manually:
pip install torch torchvision torchaudio
pip install accelerate transformers peft einops omegaconf lmdb tqdm scipy wandb
Download the pretrained UniLS models from HuggingFace.
For FLAME support, download the official FLAME 2020 model from FLAME official website, then convert it by running python tools/convert_flame.py. This will generate a flame_2020.pt file, which can be loaded without chumpy.
Download the dataset from UniLS-Talk Dataset.
UniLS follows a three-stage training pipeline:
Stage 1: Motion Codec (VAE)
python train.py -c unils_codec
Stage 2: Audio-Free Autoregressive Generator
Modify VAE_PATH path in the config file to point to the Stage 1 checkpoint, then run:
python train.py -c unils_freegen
Stage 3: Audio-Conditioned LoRA Fine-tuning
Modify PRETRAIN_PATH path in the config file to point to the Stage 2 checkpoint, then run:
python train.py -c unils_loragen
Run evaluation with multi-GPU support via Accelerate:
accelerate launch eval.py -r /path/to/checkpoint --tau 1.0 --cfg 1.5
You can also pass an external dataset config to override the checkpoint's dataset:
accelerate launch eval.py -r /path/to/checkpoint --dataset configs/dataset.yaml
Generate visualizations from the dataset:
python infer_dataset.py -r /path/to/checkpoint --clip_length 20 --tau 1.0 --cfg 1.5 --num_samples 32
--resume_path, -r: Path to the trained model checkpoint.--dataset: Path to a dataset YAML config (optional, uses checkpoint config by default).--clip_length: Duration of the generated clip in seconds (default: 20).--tau: Temperature for sampling (default: 1.0).--cfg: Classifier-free guidance scale (default: 1.5).--num_samples, -n: Number of samples to generate (default: 32).--dump_dir, -d: Output directory (default:./render_results).
Generate visualizations directly from audio files, supporting one or two speakers:
# Single speaker
python infer_audio.py -r /path/to/checkpoint -a speaker0.wav
# Two speakers (dyadic conversation)
python infer_audio.py -r /path/to/checkpoint -a speaker0.wav --audio2 speaker1.wav
--resume_path, -r: Path to the trained model checkpoint.--audio, -a: Path to speaker 0 audio file.--audio2: Path to speaker 1 audio file (optional; if omitted, only speaker 0 motion is generated).--tau: Temperature for sampling (default: 1.0).--cfg: Classifier-free guidance scale (default: 1.5).--dump_dir, -d: Output directory (default:./render_results).
Some part of our work is built based on FLAME. We also thank the following projects:
If you find our work useful in your research, please consider citing:
@misc{chu2025unils,
title={UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking},
author={Xuangeng Chu and Ruicong Liu and Yifei Huang and Yun Liu and Yichen Peng and Bo Zheng},
year={2025},
eprint={2512.09327},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.09327},
}