This repository contains the code for the paper titled "Synergistic audio pre-processing and neural architecture design maximizes performance".
Python version: 3.10.12
To set up the environment, run the following commands:
python3 -m venv .venv
pip install nni torch torchvision torchaudio pytorch_lightning fcwt matplotlib wgetThe datasets are automatically downloaded when running the run_experiments.py script for the first time on a specific dataset.
Here, we trained MobileNetv2, MobileNetV3-small and MobileNetV3-large together with a fixed preprocessing (N_FFT= 25ms, HOP_LENGTH= 10ms, N_MELS= 64)
python baselines.py --dataset [speech_commands, vocal_sound, spoken100] --model [mobilenetv2, mobilenetv3small, mobilenetv3large]BC-ResNet-8 is a keyword spotting SOTA architecture.
python sota_baselines.py --dataset speech_commands --model bcresnet8EfficientNet-B0 is a general-purpose efficient CNN baseline.
python sota_baselines.py --dataset [vocal_sound, spoken100] --model efficientnet-b0Results are saved to results/baselines/{dataset}/{model}/ with:
model.pth- Best model checkpointval_accs.csv- Validation accuracies per epochbest_val_accuracy.txt- Best validation accuracytest_accuracy.txt- Final test accuracytest_accuracies.json- Summary with mean/std across 5 seeds
To reproduce our results, you can execute the following steps:
To run the OptModel experiment, use the following command:
python run_experiment.py --experiment 1 --dataset [speech_commands, vocal_sound, spoken100]To run the OptPre experiment, use the following command:
python run_experiment.py --experiment 2 --dataset [speech_commands, vocal_sound, spoken100] --model [mobilenetv2, mobilenetv3small, mobilenetv3large]To run the OptBoth experiment, use the following command:
python run_experiment.py --experiment 3 --dataset [speech_commands, vocal_sound, spoken100]