Official PyTorch Implementation of "An Efficient Transformer-Based Model for Voice Activity Detection"
|
|
Voice Activity Detection (VAD) aims to distinguish, at a given time, between desired speech and non-speech. Although many state-of-the-art approaches for increasing the performance of VAD have been proposed, they are still not robust enough to be applied under adverse noise conditions with low Signal-to-Noise Ratio (SNR). To deal with this issue, we propose a novel transformer-based architecture for VAD with reduced computational complexity by implementing efficient depth-wise convolutions on feature patches. The proposed model, named Tr-VAD, demonstrates better performance compared to baseline methods from the literature in a variety of scenarios considered with the smallest possible number of parameters. The results also indicate that the use of a combination of Audio Fingerprinting (AFP) features with Tr-VAD can guarantee better performance.
-
Requirement: python version >= 3.7.
pip install -r requirements.txt
-
Dataset download and unzip.
-
Clone the repo.
git clone https://github.com/Yifei-ZHAO96/Tr-VAD.git cd Tr-VAD -
Data Preparation.
# requires TIMIT dataset and noise dataset being downloaded and unzipped first python data_gen.py <path/to/TIMIT/dataset> <path/to/noise/dataset> -sr 16000 -silence_pad 1
- For more information, please read the instructions in
data_generation/README.md.
- For more information, please read the instructions in
-
Data Preprocessing
Preprocess the augmented data and obtain the AFPC feature. The
<path/to/TIMIT_augmented/TRAINis the output of theData Preparationstage.python preprocess.py '<path/to/TIMIT_augmented/TRAIN' -silence_pad 1"
-
Training
- Setup hyper-parameters in
params.py. - Run script
python train.py --train_data_path 'XXX' - Setup hyper-parameters in
-
Inference
python inference.py --input_path './data_test/[NOISE]61-70968-0000_SNR(00)_airport.WAV' --checkpoint_path './checkpoint/weights_10_acc_97.09.pth'
- TIMIT Dataset
- LibriSpeech Dataset
- Noise Dataset, including
Noise15,Noisex92andNonspeechnoises.
![]() Clean Audio (from LibriSpeech test-clean `61-70968-0000.flac`.) |
![]() Noisy Audio with AURORA `airport` Noise, SNR=0dB. |
![]() Clean Audio (from TIMIT TEST, padded with 1s silence before and after the utterance.) |
![]() Noisy Audio with AURORA `airport` Noise, SNR=0dB. |
Sample 3, from Paper
![]() Comparison of Performance of Different VAD Models. |
- The model is trained on a mini-batch of 512, sampling rate of 16000, window step of 256, (equivalent to 512 * 256 / 16000 ~= 8.2 seconds), if you want to apply to scenarios with higher sampling rate, much longer duration of audios, you may need to adjust the batch size and learning rate and retrain the model.
- Evaluation of the model is also based on the parameter settings as explained in the paper.
- GPU: Cuda memory >= 8GB
- Storage >= 100GB (Storing training data)
- BibTeX
@INPROCEEDINGS{9943501, author={Zhao, Yifei and Champagne, Benoit}, booktitle={2022 IEEE 32nd International Workshop on Machine Learning for Signal Processing (MLSP)}, title={An Efficient Transformer-Based Model for Voice Activity Detection}, year={2022}, volume={}, number={}, pages={1-6}, keywords={Voice activity detection;Convolution;Computational modeling;Machine learning;Fingerprint recognition;Predictive models;Transformers;Feature extraction;Computational complexity;Signal to noise ratio;Voice activity detection;transformer-based architecture;audio fingerprinting}, doi={10.1109/MLSP55214.2022.9943501}}



_airport.png)
_airport.png)
