CRNN-based audio tagging for environmental sound classification.
This repository extends the original ksanjeevan/crnn-audio-classification with:
- Support for ESC-50 dataset (Piczak, 2015) in addition to the original UrbanSound8K dataset
- Unified codebase for both ESC-50 and UrbanSound8K (dynamic
num_classes) - Full k-fold cross-validation pipeline (5-fold ESC-50, 10-fold UrbanSound8K)
- Reproducible multi-seed training via SLURM array jobs
- Automated hyperparameter optimization via Optuna TPE sampler
- Batch evaluation, post-processing, and statistical comparison scripts
This repository accompanies the master's thesis: "Comparative Analysis of Deep Learning Architectures for Audio Tagging and Sound Event Detection" Yu Chia Kuo, McGill University, 2026.
CRNN: 3-layer CNN feature extractor → 2-layer LSTM → Linear classifier
AudioCRNN(
(spec): MelspectrogramStretch(num_bands=128, fft_len=2048, norm=spec_whiten)
(net):
(convs): Conv2d(1→32→64→64) + BN + ELU + MaxPool + Dropout(0.1) × 3
(recur): LSTM(128, 64, num_layers=2)
(dense): Dropout(0.3) + BN1d + Linear(64 → num_classes)
)
Trainable parameters: ~256K (10-class) / ~261K (50-class)
Architecture defined in crnn.cfg via torchparse. The final Linear layer output size is dynamically set to num_classes at runtime.
crnn-at/
├── configs/
│ ├── esc_folds/
│ │ ├── baseline/ config_fold{1-5}.json
│ │ ├── baseline_with_es/ config_fold{1-5}.json
│ │ └── optuna_top1/ fold{1-5}_TOP1.json
│ └── urban_folds/
│ ├── baseline/ config_fold{1-10}.json
│ ├── baseline_with_es/ config_fold{1-10}.json
│ └── optuna_top1/ fold{1-10}_TOP1.json
├── crnn-audio-classification/
│ ├── data/ CSVDataManager, transforms
│ ├── eval/ ClassificationEvaluator
│ ├── net/ AudioCRNN, loss, metrics
│ ├── train/ Trainer
│ ├── utils/
│ ├── optuna/ HPO scripts
│ ├── run.py Main entry point (train / eval)
│ ├── batch_eval.py Batch evaluation across folds
│ ├── convert_all_recursive.py Checkpoint conversion
│ ├── crnn_post_local.py Post-processing (summary, confusion matrix)
│ ├── compare_baseline_tuned.py Statistical comparison
│ └── crnn.cfg Network architecture definition
├── data/ Symlinks to dataset directories
├── torchaudio-contrib/ Audio processing utilities
├── run_array.sh SLURM array job script
├── run_optuna.sh SLURM Optuna HPO job script
└── requirements.txt
python3.11 -m venv venv
source venv/bin/activate
pip install -r requirements.txt# torchaudio-contrib (from included subdir)
cd torchaudio-contrib && pip install -e . && cd ..
# torchparse (architecture parser)
pip install git+https://github.com/ksanjeevan/torchparse.gitDownload and extract the datasets, then create symlinks:
mkdir -p data
ln -s /path/to/ESC-50-master data/ESC-50-master
ln -s /path/to/UrbanSound8K data/UrbanSound8KUpdate configs/ JSON files if your dataset paths differ from ../data/.
Create an env.sh file (not tracked by git):
cat << 'EOF' > env.sh
export VENV_PATH=/path/to/venv/bin/activate
export PROJECT_ROOT=/path/to/crnn-at
EOFSource it before submitting SLURM jobs:
source env.shcd crnn-audio-classification
# ESC-50 fold 1, baseline
python run.py train -c ../configs/esc_folds/baseline/config_fold1.json --cfg crnn.cfg
# ESC-50 fold 1, Optuna TOP1, seed=0
python run.py train -c ../configs/esc_folds/optuna_top1/fold1_TOP1.json --cfg crnn.cfg --seed 0cd /path/to/crnn-at
source env.sh
# Baseline — all folds
DATASET=esc sbatch --array=1-5 run_array.sh
DATASET=urban sbatch --array=1-10 run_array.sh
# Optuna TOP1 — 3 seeds
for seed in 0 1 2; do
DATASET=esc CONFIG_TYPE=optuna_top1 SEED=$seed sbatch --array=1-5 run_array.sh
DATASET=urban CONFIG_TYPE=optuna_top1 SEED=$seed sbatch --array=1-10 run_array.sh
donecd crnn-audio-classification
# 1. Convert checkpoints
python convert_all_recursive.py
# 2. Batch evaluation
python batch_eval.py --base_dir saved_cv/esc/baseline --out eval_results/esc_baseline
# 3. Post-processing (summary CSV, pooled confusion matrix)
python crnn_post_local.py --dataset esc50 --eval_dir eval_results/esc_baseline
# 4. Statistical comparison (baseline vs tuned)
python compare_baseline_tuned.pypython run.py eval -r path/to/model_clean.pthAll experiments use 3 random seeds (0, 1, 2) per fold. Reported Macro-F1 values are per-seed results and 3-seed averages. Optuna hyperparameters were tuned on ESC-50 fold 1 only (30 TPE trials), then transferred directly to all folds and to UrbanSound8K without dataset-specific retuning.
| Dataset | Baseline F1 | Tuned F1 | Δ F1 |
|---|---|---|---|
| ESC-50 (5-fold) | 0.692 ± 0.036 | 0.705 ± 0.034 | +0.012 |
| UrbanSound8K (10-fold) | 0.742 ± 0.034 | 0.765 ± 0.050 | +0.023 |
| Fold | Seed 0 | Seed 1 | Seed 2 | 3-seed avg |
|---|---|---|---|---|
| 1 | 0.6923 | 0.6618 | 0.6792 | 0.6778 |
| 2 | 0.6918 | 0.6895 | 0.6771 | 0.6861 |
| 3 | 0.7022 | 0.6820 | 0.6937 | 0.6926 |
| 4 | 0.7541 | 0.7448 | 0.7557 | 0.7515 |
| 5 | 0.6403 | 0.6515 | 0.6710 | 0.6543 |
| Mean ± Std | 0.696 ± 0.041 | 0.686 ± 0.036 | 0.695 ± 0.035 | 0.692 ± 0.036 |
| Fold | Seed 0 | Seed 1 | Seed 2 | 3-seed avg |
|---|---|---|---|---|
| 1 | 0.707 | 0.677 | 0.693 | 0.692 |
| 2 | 0.703 | 0.682 | 0.695 | 0.693 |
| 3 | 0.705 | 0.699 | 0.711 | 0.705 |
| 4 | 0.769 | 0.762 | 0.751 | 0.761 |
| 5 | 0.674 | 0.675 | 0.663 | 0.671 |
| Mean ± Std | 0.712 ± 0.035 | 0.699 ± 0.036 | 0.703 ± 0.032 | 0.705 ± 0.034 |
| Fold | Seed 0 | Seed 1 | Seed 2 | 3-seed avg |
|---|---|---|---|---|
| 1 | 0.7337 | 0.7243 | 0.7072 | 0.7217 |
| 2 | 0.7113 | 0.7356 | 0.7399 | 0.7289 |
| 3 | 0.6707 | 0.6886 | 0.6677 | 0.6757 |
| 4 | 0.6955 | 0.7018 | 0.7233 | 0.7069 |
| 5 | 0.7825 | 0.7625 | 0.7560 | 0.7670 |
| 6 | 0.7792 | 0.7417 | 0.7864 | 0.7691 |
| 7 | 0.7506 | 0.7623 | 0.7338 | 0.7489 |
| 8 | 0.7693 | 0.7530 | 0.7572 | 0.7598 |
| 9 | 0.7370 | 0.7588 | 0.7565 | 0.7508 |
| 10 | 0.7893 | 0.8004 | 0.7865 | 0.7921 |
| Mean ± Std | 0.742 ± 0.040 | 0.743 ± 0.033 | 0.741 ± 0.036 | 0.742 ± 0.034 |
| Fold | Seed 0 | Seed 1 | Seed 2 | 3-seed avg |
|---|---|---|---|---|
| 1 | 0.711 | 0.729 | 0.722 | 0.721 |
| 2 | 0.727 | 0.748 | 0.749 | 0.741 |
| 3 | 0.658 | 0.680 | 0.692 | 0.677 |
| 4 | 0.745 | 0.719 | 0.723 | 0.729 |
| 5 | 0.837 | 0.842 | 0.867 | 0.849 |
| 6 | 0.765 | 0.802 | 0.761 | 0.776 |
| 7 | 0.791 | 0.782 | 0.762 | 0.778 |
| 8 | 0.783 | 0.756 | 0.780 | 0.773 |
| 9 | 0.776 | 0.792 | 0.791 | 0.786 |
| 10 | 0.823 | 0.808 | 0.820 | 0.817 |
| Mean ± Std | 0.762 ± 0.054 | 0.766 ± 0.048 | 0.767 ± 0.051 | 0.765 ± 0.050 |
| Parameter | Search Range | Best Value |
|---|---|---|
| lr | [5e-4, 5e-3] log | 9.680e-4 |
| weight_decay | [1e-6, 1e-3] log | 9.302e-4 |
| step_size | [6, 12] int | 10 |
| gamma | [0.5, 0.9] | 0.699 |
| epochs | fixed | 50 |
| early_stop | fixed off | — |
HPO: Optuna TPE sampler (Akiba et al., 2019), seed=42, 30 trials, maximize eval macro-F1.
- Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna: A Next-Generation Hyperparameter Optimization Framework. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2623–2631.
- Bergstra, J., & Bengio, Y. (2012). Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(2), 281–305.
- Kingma, D. P., & Ba, J. (2017). Adam: A Method for Stochastic Optimization. https://arxiv.org/abs/1412.6980
- Piczak, K. J. (2015). ESC: Dataset for environmental sound classification. Proceedings of the 23rd ACM International Conference on Multimedia, 1015–1018.
- Salamon, J., Jacoby, C., & Bello, J. P. (2014). A dataset and taxonomy for urban sound research. Proceedings of the 22nd ACM International Conference on Multimedia, 1041–1044.
- Original codebase: ksanjeevan/crnn-audio-classification (MIT License)
This project extends ksanjeevan/crnn-audio-classification, originally released under the MIT License. See crnn-audio-classification/LICENSE for details.