This repository can be used to align lyrics transcripts with the corresponding audio signals. The audio signals may contain solo singing or singing voice mixed with other instruments. It contains a trained deep neural network which performs alignment and singing voice separation jointly.
Details about the model, training, and data are described in the associated paper:
Schulze-Forster, K., Doire, C., Richard, G., & Badeau, R. "Phoneme Level Lyrics Alignment and Text-Informed Singing Voice Separation." IEEE/ACM Transactions on Audio, Speech and Language Processing (2021). DOI: 10.1109/TASLP.2021.3091817. Public version available here.
If you use the model or code, please cite the paper:
@article{schulze2021phoneme,
author={Schulze-Forster, Kilian and Doire, Clement S. J. and Richard, Gaël and Badeau, Roland},
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
title={Phoneme Level Lyrics Alignment and Text-Informed Singing Voice Separation},
year={2021},
volume={29},
number={},
pages={2382-2395},
doi={10.1109/TASLP.2021.3091817}
}
-
Clone the repository:
git clone https://github.com/Caio99BR/lyrics-aligner.git cd lyrics-aligner -
Install the required packages with pip:
pip install pyqt5 decorator ffmpeg audioread resampy librosa pysoundfile praatio torchvision torchaudio paramiko cryptography pyopenssl
Place all audio files in input directory (You can use sub-folders). Audio files are loaded using librosa, so all formats supported by librosa (e.g., .wav, .mp3) are accepted. See the librosa documentation for more.
Place all lyrics files in the input directory (You can use sub-folders). Each .txt lyrics file must have the same name as the corresponding audio file (e.g., input/song1.wav ➝ input/song1.txt).
You can provide lyrics as words or as phonemes.
- Use only the 39 ARPAbet phonemes listed here.
- One phoneme per line.
- The first and last symbols should be a space character
>. - Use
>between words or wherever silence is expected.
Note: If lyrics are given as phonemes, only phoneme onsets will be computed.
If providing lyrics as words:
-
Create a list of unique words:
python make_word_list.py PATH/TO/LYRICS --dataset-name NAME
-
Go to CMU LexTool and upload
NAME_word_list.txt. -
Copy the generated
.dictfile content and paste it intoinput/NAME_word2phoneme.txt. -
Convert it into a phoneme dictionary:
python make_word2phoneme_dict.py --dataset-name NAME
To compute phoneme and/or word onsets:
bash python align.py PATH/TO/INPUTS --lyrics-format w --onsets p --dataset-name dataset1 --vad-threshold 0
-
--lyrics-formatMust bewif the lyrics are provided as words (and have been processed as described above) andpif the lyrics are provided as phonemes. -
--onsetsIf phoneme onsets should be computed, set top. If word onsets should be computed, set tow. If phoneme and word onsets should be computed, set topw(only possible if lyrics are provided as words). -
--dataset-nameShould be the same as used for data preparation above. -
--vad-thresholdThe model also computes an estimate of the isolated singing voice which can be used as Voice Activity Detector (VAD). This may be useful in challenging scenarios where long pauses are made by the singer while instruments are playing (e.g., intro, soli, outro). The magnitude of the vocals estimate is computed. Here a threshold (float) can be set to discriminate between active and inactive voice given the magnitude. The default is 0, which means that no VAD is used. The optimal value for a given audio signal may be difficult to determine as it depends on the loudness of the voice. In our experiments, we used values between 0 and 30. You could print or plot the voice magnitude (computed in line 235) to get an intuition for an appropriate value. We recommend using the option only if large errors are made on audio files with long instrumental sections. -
PATH/TO/INPUTSSpecifies the directory containing the input audio and lyrics files. By default, the script looks for files in theinputsdirectory. If you wish to use a different directory, provide its path.
This project received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 765068.
© 2021 Kilian Schulze-Forster, Télécom Paris, Institut Polytechnique de Paris. All rights reserved.