Skip to content

AI-Unicamp/SelfTTS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SelfTTS

Cross-speaker style transfer through explicit embedding disentanglement and self-refinement using self-augmentation

Publication Python PyTorch SLURM License DOI

Official implementation of the SelfTTS paper


✨ Overview

SelfTTS is a text-to-speech system designed for cross-speaker style transfer — enabling the voice characteristics and emotional style of one speaker to be applied to another. It achieves this through two key innovations:

  • Explicit Embedding Disentanglement — cleanly separating speaker identity from speaking style in the latent space
  • Self-Refinement via Self-Augmentation — iteratively improving output quality using its own generated data as training signal

This repository is based on VITS official implementation, for reproducibility and adaptability purposes we keep almost all structure the same as it.


📦 Dataset

We use the ESD (Emotional Speech Dataset) for training.

Download the dataset through the official ESD repository, then organize it into the following structure before training:

files/
├── 0011_Angry/
├── 0011_Happy/
├── 0011_Neutral/
│   └── *.wav
├── ...
└── 0020_Surprise/

Each speaker's emotional recordings should be split into flat directories named {speaker_id}_{Emotion} (e.g., 0011_Angry, 0012_Happy). The provided filelists assume this layout — if your directory structure differs, you will need to adapt the filelists accordingly.


🛠️ Environment Setup

We provide a setup script that assumes a Conda installation. It will automatically create a new environment named selftts and install all dependencies from requirements.txt.

sh make_selftts_env.sh

Feel free to adapt the environment configuration to your own needs.

Also build the monotonic alignement:

conda activate selftts
cd monotonic_align
mkdir monotonic_align
python setup.py build_ext --inplace

🚀 Training

Step 1 — Download the Base VCTK Model

mkdir vctk_base_16k
wget https://github.com/AI-Unicamp/SelfTTS/releases/download/v1.0.0/G_800000.pth -O vctk_base_16k/G_800000.pth
wget https://github.com/AI-Unicamp/SelfTTS/releases/download/v1.0.0/D_800000.pth -O vctk_base_16k/D_800000.pth

Step 2 — Train SelfTTS

We provide a SLURM script for HPC environments. Make sure to replace {your_esd_base_path} to your ESD data path.

sbatch run_selftts.sh

Or run manually:

source ~/miniconda3/bin/activate
conda activate selftts

# Link your ESD dataset
rm -rf DUMMY3
ln -s {your_esd_base_path} DUMMY3

# Launch training
python train_ms_emotion.py -c configs/selftts_training.json -m selftts_training

Step 3 — Train Self-Refinement with Self-Augmentation

Ensure to give the right path at the corresponding config file. If you want to train only the self-augmentation step we provide a checkpoint of SelfTTS:

mkdir logs/
mkdir logs/selftts
wget https://github.com/AI-Unicamp/SelfTTS/releases/download/v1.0.0/G_200000.pth -O logs/selftts/G_200000.pth
wget https://github.com/AI-Unicamp/SelfTTS/releases/download/v1.0.0/D_200000.pth -O logs/selftts/D_200000.pth

We provide a dedicated SLURM script. Make sure to replace {your_esd_base_path} to your ESD data path.

sbatch run_selftts_selfaugmentation.sh

Or run manually:

source ~/miniconda3/bin/activate
conda activate selftts

# Link your ESD dataset
rm -rf DUMMY3
ln -s {your_esd_base_path} DUMMY3

# Launch self-augmentation training
python train_ms_emotion_selfaug.py -c configs/selftts_selfaugmentation.json -m selftts_selfaugmentation

🔗 Acknowledgements

This work builds upon and is inspired by the following open-source projects:

Project Description
VITS End-to-end TTS backbone
syn-rep-learn Synthetic representation learning
Coqui TTS TTS toolkit reference

📄 Citation

Citation coming soon — paper under review.

@article{ueda2026selftts,
  title   = {SelfTTS: Cross-Speaker Style Transfer through Explicit Embedding
             Disentanglement and Self-Refinement using Self-Augmentation},
  author  = {Ueda, Lucas H., Lima, João G.T., Corrêa, Pedro R., Costa, Paula D.P.},
  year    = {2026},
  note    = {Coming soon}
}
@software{ueda2026selfttsrepository,
  author       = {Lucas Hideki Ueda},
  title        = {AI-Unicamp/SelfTTS: v1.0.1},
  month        = feb,
  year         = 2026,
  publisher    = {Zenodo},
  version      = {v1.0.1},
  doi          = {10.5281/zenodo.18744290},
  url          = {https://doi.org/10.5281/zenodo.18744290},
  swhid        = {swh:1:dir:ea47be5c0d3941f26121b5db9131112197a08f95
                   ;origin=https://doi.org/10.5281/zenodo.18744289;vi
                   sit=swh:1:snp:8dbb8ca41c05a303bfdefd2bc07125b0eefb
                   18f6;anchor=swh:1:rel:f264b90856ae8521cb8307b07c3c
                   7669d539d3f7;path=AI-Unicamp-SelfTTS-8a6e4ac
                  },
}

About

Official implementation of SelfTTS: cross-speaker style transfer through explicit embedding disentanglement and self-refinement using self-augmentation

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages