The introduction, technical details, and results are presented in the wandb report.
To get started install the requirements
pip install -r ./requirements.txtSpEX+ architecture with additional speaker classification head was implemented in this project.
To train model from scratch run
python3 train.py -c final_model/config.jsonFor fine-tuning pretrained model from checkpoint, --resume parameter is applied. For example, fine-tuning pretrained model
organized as follows
python3 train.py -c final_model/finetune.json -r saved/models/pretrain_final/<run_id>/model_best.pthThis command generates new mixed dataset. This option can be disabled by passing "reuse": true for
train dataset in config final_model/finetune.json.
Before applying model pretrained checkpoint is loaded by python code
import gdown
gdown.download("https://drive.google.com/uc?id=19i4NIk8R8AlkGvMfhQl8ex-eCg4g2Isv", "default_test_model/checkpoint.pth")Model evaluation is executed by command
python test.py \
-c default_test_model/config.json \
-r default_test_model/checkpoint.pth \
-t test_data \
-o test_result.csv \
-g <output_dir> \
-s <interval_len>Where -o specify output .csv file, which represents metrics
- PESQ (Perceptual Evaluation of Speech Quality)
- SI-SDR (Scale-Invariant Signal-to-Distortion Ratio)
Further sections reveal other command line arguments.
Important remark: for further experiments test-clean part of Librispeech dataset is required.
It will be automatically installed after at least one execution python3 test.py with default arguments.
Model evaluation with custom data conducted by running test.py
python3 test.py -t path/to/custom/dirThis command executes model on custom dataset folder, which includes mix, target and ref subdirectories
with filenames *-mixed.wav, *-target.wav and *-ref.wav respectively. Such directory will be created in
data/datasets/mixed/data/custom from mixed test-clean dataset after running
bash custom_set.shExtracted audio for the test set can be gathered in one directory by executing
python3 test.py -g path/to/output/dirThis results can be compared with direct speech recognition from mixed audio. Speech recognition pipeline was taken from asr repository. Comparison of mixed and extracted audios quality carried out by
bash asr_score.shFor training stability audio were split into 3-seconds interval. However, test data provides arbitrary lengths of audios, which can be divided into intervals on inference stage with
python3 test.pt -s <interval_len_in_seconds>WHAM! dataset provides diverse background noise, which can be also mixed with input audio. Installation
wget https://my-bucket-a8b4b49c25c811ee9a7e8bba05fa24c7.s3.amazonaws.com/wham_noise.zip
unzip wham_noise.zipCreating noised dataset and model evaluation
python3 test.py -c wham_test/config.jsonDescribed pipeline only involves tt (test) part of WHAM!, therefore, other directories are not required.
This repository is based on an asr-template repository.