SelfTTS

Cross-speaker style transfer through explicit embedding disentanglement and self-refinement using self-augmentation

Official implementation of the SelfTTS paper

✨ Overview

SelfTTS is a text-to-speech system designed for cross-speaker style transfer — enabling the voice characteristics and emotional style of one speaker to be applied to another. It achieves this through two key innovations:

Explicit Embedding Disentanglement — cleanly separating speaker identity from speaking style in the latent space
Self-Refinement via Self-Augmentation — iteratively improving output quality using its own generated data as training signal

This repository is based on VITS official implementation, for reproducibility and adaptability purposes we keep almost all structure the same as it.

📦 Dataset

We use the ESD (Emotional Speech Dataset) for training.

Download the dataset through the official ESD repository, then organize it into the following structure before training:

files/
├── 0011_Angry/
├── 0011_Happy/
├── 0011_Neutral/
│   └── *.wav
├── ...
└── 0020_Surprise/

Each speaker's emotional recordings should be split into flat directories named {speaker_id}_{Emotion} (e.g., 0011_Angry, 0012_Happy). The provided filelists assume this layout — if your directory structure differs, you will need to adapt the filelists accordingly.

🛠️ Environment Setup

We provide a setup script that assumes a Conda installation. It will automatically create a new environment named selftts and install all dependencies from requirements.txt.

sh make_selftts_env.sh

Feel free to adapt the environment configuration to your own needs.

Also build the monotonic alignement:

conda activate selftts
cd monotonic_align
mkdir monotonic_align
python setup.py build_ext --inplace

🚀 Training

Step 1 — Download the Base VCTK Model

mkdir vctk_base_16k
wget https://github.com/AI-Unicamp/SelfTTS/releases/download/v1.0.0/G_800000.pth -O vctk_base_16k/G_800000.pth
wget https://github.com/AI-Unicamp/SelfTTS/releases/download/v1.0.0/D_800000.pth -O vctk_base_16k/D_800000.pth

Step 2 — Train SelfTTS

We provide a SLURM script for HPC environments. Make sure to replace {your_esd_base_path} to your ESD data path.

sbatch run_selftts.sh

Or run manually:

source ~/miniconda3/bin/activate
conda activate selftts

# Link your ESD dataset
rm -rf DUMMY3
ln -s {your_esd_base_path} DUMMY3

# Launch training
python train_ms_emotion.py -c configs/selftts_training.json -m selftts_training

Step 3 — Train Self-Refinement with Self-Augmentation

Ensure to give the right path at the corresponding config file. If you want to train only the self-augmentation step we provide a checkpoint of SelfTTS:

mkdir logs/
mkdir logs/selftts
wget https://github.com/AI-Unicamp/SelfTTS/releases/download/v1.0.0/G_200000.pth -O logs/selftts/G_200000.pth
wget https://github.com/AI-Unicamp/SelfTTS/releases/download/v1.0.0/D_200000.pth -O logs/selftts/D_200000.pth

We provide a dedicated SLURM script. Make sure to replace {your_esd_base_path} to your ESD data path.

sbatch run_selftts_selfaugmentation.sh

Or run manually:

source ~/miniconda3/bin/activate
conda activate selftts

# Link your ESD dataset
rm -rf DUMMY3
ln -s {your_esd_base_path} DUMMY3

# Launch self-augmentation training
python train_ms_emotion_selfaug.py -c configs/selftts_selfaugmentation.json -m selftts_selfaugmentation

🔗 Acknowledgements

This work builds upon and is inspired by the following open-source projects:

Project	Description
VITS	End-to-end TTS backbone
syn-rep-learn	Synthetic representation learning
Coqui TTS	TTS toolkit reference

📄 Citation

Citation coming soon — paper under review.

@article{ueda2026selftts,
  title   = {SelfTTS: Cross-Speaker Style Transfer through Explicit Embedding
             Disentanglement and Self-Refinement using Self-Augmentation},
  author  = {Ueda, Lucas H., Lima, João G.T., Corrêa, Pedro R., Costa, Paula D.P.},
  year    = {2026},
  note    = {Coming soon}
}

@software{ueda2026selfttsrepository,
  author       = {Lucas Hideki Ueda},
  title        = {AI-Unicamp/SelfTTS: v1.0.1},
  month        = feb,
  year         = 2026,
  publisher    = {Zenodo},
  version      = {v1.0.1},
  doi          = {10.5281/zenodo.18744290},
  url          = {https://doi.org/10.5281/zenodo.18744290},
  swhid        = {swh:1:dir:ea47be5c0d3941f26121b5db9131112197a08f95
                   ;origin=https://doi.org/10.5281/zenodo.18744289;vi
                   sit=swh:1:snp:8dbb8ca41c05a303bfdefd2bc07125b0eefb
                   18f6;anchor=swh:1:rel:f264b90856ae8521cb8307b07c3c
                   7669d539d3f7;path=AI-Unicamp-SelfTTS-8a6e4ac
                  },
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
configs		configs
filelists		filelists
monotonic_align		monotonic_align
resources		resources
text		text
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
attentions.py		attentions.py
commons.py		commons.py
data_utils.py		data_utils.py
losses.py		losses.py
make_selftts_env.sh		make_selftts_env.sh
mel_processing.py		mel_processing.py
models.py		models.py
modules.py		modules.py
modules_grl.py		modules_grl.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
run_selftts.sh		run_selftts.sh
run_selftts_selfaugmentation.sh		run_selftts_selfaugmentation.sh
speaker_encoder.py		speaker_encoder.py
style_encoder.py		style_encoder.py
train.py		train.py
train_ms.py		train_ms.py
train_ms_emotion.py		train_ms_emotion.py
train_ms_emotion_selfaug.py		train_ms_emotion_selfaug.py
transforms.py		transforms.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SelfTTS

✨ Overview

📦 Dataset

🛠️ Environment Setup

🚀 Training

Step 1 — Download the Base VCTK Model

Step 2 — Train SelfTTS

Step 3 — Train Self-Refinement with Self-Augmentation

🔗 Acknowledgements

📄 Citation

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SelfTTS

✨ Overview

📦 Dataset

🛠️ Environment Setup

🚀 Training

Step 1 — Download the Base VCTK Model

Step 2 — Train SelfTTS

Step 3 — Train Self-Refinement with Self-Augmentation

🔗 Acknowledgements

📄 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages