Official PyTorch Implementation of "An Efficient Transformer-Based Model for Voice Activity Detection"

Abastract

Voice Activity Detection (VAD) aims to distinguish, at a given time, between desired speech and non-speech. Although many state-of-the-art approaches for increasing the performance of VAD have been proposed, they are still not robust enough to be applied under adverse noise conditions with low Signal-to-Noise Ratio (SNR). To deal with this issue, we propose a novel transformer-based architecture for VAD with reduced computational complexity by implementing efficient depth-wise convolutions on feature patches. The proposed model, named Tr-VAD, demonstrates better performance compared to baseline methods from the literature in a variety of scenarios considered with the smallest possible number of parameters. The results also indicate that the use of a combination of Audio Fingerprinting (AFP) features with Tr-VAD can guarantee better performance.

How to run?

Requirement: python version >= 3.7.
```
  pip install -r requirements.txt
```
Dataset download and unzip.

Clone the repo.

git clone https://github.com/Yifei-ZHAO96/Tr-VAD.git
cd Tr-VAD

Data Preparation.

# requires TIMIT dataset and noise dataset being downloaded and unzipped first  
python data_gen.py <path/to/TIMIT/dataset> <path/to/noise/dataset> -sr 16000 -silence_pad 1

For more information, please read the instructions in data_generation/README.md.

Data Preprocessing

Preprocess the augmented data and obtain the AFPC feature. The <path/to/TIMIT_augmented/TRAIN is the output of the Data Preparation stage.
```
python preprocess.py '<path/to/TIMIT_augmented/TRAIN' -silence_pad 1"
```
Training
- Setup hyper-parameters in params.py.
- Run script
```
python train.py --train_data_path 'XXX'
```

Inference

python inference.py --input_path './data_test/[NOISE]61-70968-0000_SNR(00)_airport.WAV' --checkpoint_path './checkpoint/weights_10_acc_97.09.pth'

Dataset Preparation

TIMIT Dataset
LibriSpeech Dataset
Noise Dataset, including Noise15, Noisex92 and Nonspeech noises.

Examples

Sample 1

Clean Audio (from LibriSpeech test-clean `61-70968-0000.flac`.)

Noisy Audio with AURORA `airport` Noise, SNR=0dB.

Sample 2

Clean Audio (from TIMIT TEST, padded with 1s silence before and after the utterance.)

Noisy Audio with AURORA `airport` Noise, SNR=0dB.

Sample 3, from Paper

Comparison of Performance of Different VAD Models.

Note

The model is trained on a mini-batch of 512, sampling rate of 16000, window step of 256, (equivalent to 512 * 256 / 16000 ~= 8.2 seconds), if you want to apply to scenarios with higher sampling rate, much longer duration of audios, you may need to adjust the batch size and learning rate and retrain the model.
Evaluation of the model is also based on the parameter settings as explained in the paper.

Hardware Requirements

GPU: Cuda memory >= 8GB
Storage >= 100GB (Storing training data)

Citation

BibTeX

@INPROCEEDINGS{9943501,
author={Zhao, Yifei and Champagne, Benoit},
booktitle={2022 IEEE 32nd International Workshop on Machine Learning for Signal Processing (MLSP)}, 
title={An Efficient Transformer-Based Model for Voice Activity Detection}, 
year={2022},
volume={},
number={},
pages={1-6},
keywords={Voice activity detection;Convolution;Computational modeling;Machine learning;Fingerprint recognition;Predictive models;Transformers;Feature extraction;Computational complexity;Signal to noise ratio;Voice activity detection;transformer-based architecture;audio fingerprinting},
doi={10.1109/MLSP55214.2022.9943501}}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.vscode		.vscode
AFPC_feature		AFPC_feature
checkpoint		checkpoint
data_generation		data_generation
data_test		data_test
images		images
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
VAD_T.py		VAD_T.py
__init__.py		__init__.py
audio.py		audio.py
infer_model.py		infer_model.py
inference.py		inference.py
params.py		params.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
test_data.csv		test_data.csv
train.py		train.py
train_data.csv		train_data.csv
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Official PyTorch Implementation of "An Efficient Transformer-Based Model for Voice Activity Detection"

Abastract

How to run?

Dataset Preparation

Examples

Sample 1

Sample 2

Sample 3, from Paper

Note

Hardware Requirements

Citation

About

Uh oh!

Releases

Packages

Languages

License

yecohn/tr_vad

Folders and files

Latest commit

History

Repository files navigation

Official PyTorch Implementation of "An Efficient Transformer-Based Model for Voice Activity Detection"

Abastract

How to run?

Dataset Preparation

Examples

Sample 1

Sample 2

Sample 3, from Paper

Note

Hardware Requirements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages