Cross-speaker style transfer through explicit embedding disentanglement and self-refinement using self-augmentation
Official implementation of the SelfTTS paper
SelfTTS is a text-to-speech system designed for cross-speaker style transfer — enabling the voice characteristics and emotional style of one speaker to be applied to another. It achieves this through two key innovations:
- Explicit Embedding Disentanglement — cleanly separating speaker identity from speaking style in the latent space
- Self-Refinement via Self-Augmentation — iteratively improving output quality using its own generated data as training signal
This repository is based on VITS official implementation, for reproducibility and adaptability purposes we keep almost all structure the same as it.
We use the ESD (Emotional Speech Dataset) for training.
Download the dataset through the official ESD repository, then organize it into the following structure before training:
files/
├── 0011_Angry/
├── 0011_Happy/
├── 0011_Neutral/
│ └── *.wav
├── ...
└── 0020_Surprise/
Each speaker's emotional recordings should be split into flat directories named {speaker_id}_{Emotion} (e.g., 0011_Angry, 0012_Happy). The provided filelists assume this layout — if your directory structure differs, you will need to adapt the filelists accordingly.
We provide a setup script that assumes a Conda installation. It will automatically create a new environment named selftts and install all dependencies from requirements.txt.
sh make_selftts_env.shFeel free to adapt the environment configuration to your own needs.
Also build the monotonic alignement:
conda activate selftts
cd monotonic_align
mkdir monotonic_align
python setup.py build_ext --inplacemkdir vctk_base_16k
wget https://github.com/AI-Unicamp/SelfTTS/releases/download/v1.0.0/G_800000.pth -O vctk_base_16k/G_800000.pth
wget https://github.com/AI-Unicamp/SelfTTS/releases/download/v1.0.0/D_800000.pth -O vctk_base_16k/D_800000.pthWe provide a SLURM script for HPC environments. Make sure to replace {your_esd_base_path} to your ESD data path.
sbatch run_selftts.shOr run manually:
source ~/miniconda3/bin/activate
conda activate selftts
# Link your ESD dataset
rm -rf DUMMY3
ln -s {your_esd_base_path} DUMMY3
# Launch training
python train_ms_emotion.py -c configs/selftts_training.json -m selftts_trainingEnsure to give the right path at the corresponding config file. If you want to train only the self-augmentation step we provide a checkpoint of SelfTTS:
mkdir logs/
mkdir logs/selftts
wget https://github.com/AI-Unicamp/SelfTTS/releases/download/v1.0.0/G_200000.pth -O logs/selftts/G_200000.pth
wget https://github.com/AI-Unicamp/SelfTTS/releases/download/v1.0.0/D_200000.pth -O logs/selftts/D_200000.pthWe provide a dedicated SLURM script. Make sure to replace {your_esd_base_path} to your ESD data path.
sbatch run_selftts_selfaugmentation.shOr run manually:
source ~/miniconda3/bin/activate
conda activate selftts
# Link your ESD dataset
rm -rf DUMMY3
ln -s {your_esd_base_path} DUMMY3
# Launch self-augmentation training
python train_ms_emotion_selfaug.py -c configs/selftts_selfaugmentation.json -m selftts_selfaugmentationThis work builds upon and is inspired by the following open-source projects:
| Project | Description |
|---|---|
| VITS | End-to-end TTS backbone |
| syn-rep-learn | Synthetic representation learning |
| Coqui TTS | TTS toolkit reference |
Citation coming soon — paper under review.
@article{ueda2026selftts,
title = {SelfTTS: Cross-Speaker Style Transfer through Explicit Embedding
Disentanglement and Self-Refinement using Self-Augmentation},
author = {Ueda, Lucas H., Lima, João G.T., Corrêa, Pedro R., Costa, Paula D.P.},
year = {2026},
note = {Coming soon}
}@software{ueda2026selfttsrepository,
author = {Lucas Hideki Ueda},
title = {AI-Unicamp/SelfTTS: v1.0.1},
month = feb,
year = 2026,
publisher = {Zenodo},
version = {v1.0.1},
doi = {10.5281/zenodo.18744290},
url = {https://doi.org/10.5281/zenodo.18744290},
swhid = {swh:1:dir:ea47be5c0d3941f26121b5db9131112197a08f95
;origin=https://doi.org/10.5281/zenodo.18744289;vi
sit=swh:1:snp:8dbb8ca41c05a303bfdefd2bc07125b0eefb
18f6;anchor=swh:1:rel:f264b90856ae8521cb8307b07c3c
7669d539d3f7;path=AI-Unicamp-SelfTTS-8a6e4ac
},
}