Skip to content

kyuchia/crnn-at

Repository files navigation

crnn-at: CRNN for Environmental Sound Classification

CRNN-based audio tagging for environmental sound classification.

This repository extends the original ksanjeevan/crnn-audio-classification with:

  • Support for ESC-50 dataset (Piczak, 2015) in addition to the original UrbanSound8K dataset
  • Unified codebase for both ESC-50 and UrbanSound8K (dynamic num_classes)
  • Full k-fold cross-validation pipeline (5-fold ESC-50, 10-fold UrbanSound8K)
  • Reproducible multi-seed training via SLURM array jobs
  • Automated hyperparameter optimization via Optuna TPE sampler
  • Batch evaluation, post-processing, and statistical comparison scripts

This repository accompanies the master's thesis: "Comparative Analysis of Deep Learning Architectures for Audio Tagging and Sound Event Detection" Yu Chia Kuo, McGill University, 2026.


Model Architecture

CRNN: 3-layer CNN feature extractor → 2-layer LSTM → Linear classifier

AudioCRNN(
  (spec): MelspectrogramStretch(num_bands=128, fft_len=2048, norm=spec_whiten)
  (net):
    (convs): Conv2d(1→32→64→64) + BN + ELU + MaxPool + Dropout(0.1) × 3
    (recur): LSTM(128, 64, num_layers=2)
    (dense): Dropout(0.3) + BN1d + Linear(64 → num_classes)
)
Trainable parameters: ~256K (10-class) / ~261K (50-class)

Architecture defined in crnn.cfg via torchparse. The final Linear layer output size is dynamically set to num_classes at runtime.


Repository Structure

crnn-at/
├── configs/
│   ├── esc_folds/
│   │   ├── baseline/             config_fold{1-5}.json
│   │   ├── baseline_with_es/     config_fold{1-5}.json
│   │   └── optuna_top1/          fold{1-5}_TOP1.json
│   └── urban_folds/
│       ├── baseline/             config_fold{1-10}.json
│       ├── baseline_with_es/     config_fold{1-10}.json
│       └── optuna_top1/          fold{1-10}_TOP1.json
├── crnn-audio-classification/
│   ├── data/                     CSVDataManager, transforms
│   ├── eval/                     ClassificationEvaluator
│   ├── net/                      AudioCRNN, loss, metrics
│   ├── train/                    Trainer
│   ├── utils/
│   ├── optuna/                   HPO scripts
│   ├── run.py                    Main entry point (train / eval)
│   ├── batch_eval.py             Batch evaluation across folds
│   ├── convert_all_recursive.py  Checkpoint conversion
│   ├── crnn_post_local.py        Post-processing (summary, confusion matrix)
│   ├── compare_baseline_tuned.py Statistical comparison
│   └── crnn.cfg                  Network architecture definition
├── data/                         Symlinks to dataset directories
├── torchaudio-contrib/           Audio processing utilities
├── run_array.sh                  SLURM array job script
├── run_optuna.sh                 SLURM Optuna HPO job script
└── requirements.txt

Setup

1. Create virtual environment

python3.11 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

2. Install additional dependencies

# torchaudio-contrib (from included subdir)
cd torchaudio-contrib && pip install -e . && cd ..

# torchparse (architecture parser)
pip install git+https://github.com/ksanjeevan/torchparse.git

3. Prepare datasets

Download and extract the datasets, then create symlinks:

mkdir -p data
ln -s /path/to/ESC-50-master data/ESC-50-master
ln -s /path/to/UrbanSound8K  data/UrbanSound8K

Update configs/ JSON files if your dataset paths differ from ../data/.

4. (Optional) Configure environment variables for SLURM

Create an env.sh file (not tracked by git):

cat << 'EOF' > env.sh
export VENV_PATH=/path/to/venv/bin/activate
export PROJECT_ROOT=/path/to/crnn-at
EOF

Source it before submitting SLURM jobs:

source env.sh

Usage

Training (single run)

cd crnn-audio-classification

# ESC-50 fold 1, baseline
python run.py train -c ../configs/esc_folds/baseline/config_fold1.json --cfg crnn.cfg

# ESC-50 fold 1, Optuna TOP1, seed=0
python run.py train -c ../configs/esc_folds/optuna_top1/fold1_TOP1.json --cfg crnn.cfg --seed 0

Training (SLURM array jobs)

cd /path/to/crnn-at
source env.sh

# Baseline — all folds
DATASET=esc   sbatch --array=1-5  run_array.sh
DATASET=urban sbatch --array=1-10 run_array.sh

# Optuna TOP1 — 3 seeds
for seed in 0 1 2; do
    DATASET=esc   CONFIG_TYPE=optuna_top1 SEED=$seed sbatch --array=1-5  run_array.sh
    DATASET=urban CONFIG_TYPE=optuna_top1 SEED=$seed sbatch --array=1-10 run_array.sh
done

Evaluation pipeline

cd crnn-audio-classification

# 1. Convert checkpoints
python convert_all_recursive.py

# 2. Batch evaluation
python batch_eval.py --base_dir saved_cv/esc/baseline --out eval_results/esc_baseline

# 3. Post-processing (summary CSV, pooled confusion matrix)
python crnn_post_local.py --dataset esc50 --eval_dir eval_results/esc_baseline

# 4. Statistical comparison (baseline vs tuned)
python compare_baseline_tuned.py

Evaluation (single checkpoint)

python run.py eval -r path/to/model_clean.pth

Results

All experiments use 3 random seeds (0, 1, 2) per fold. Reported Macro-F1 values are per-seed results and 3-seed averages. Optuna hyperparameters were tuned on ESC-50 fold 1 only (30 TPE trials), then transferred directly to all folds and to UrbanSound8K without dataset-specific retuning.

Summary

Dataset Baseline F1 Tuned F1 Δ F1
ESC-50 (5-fold) 0.692 ± 0.036 0.705 ± 0.034 +0.012
UrbanSound8K (10-fold) 0.742 ± 0.034 0.765 ± 0.050 +0.023

ESC-50 — Baseline (Macro-F1, 3-seed)

Fold Seed 0 Seed 1 Seed 2 3-seed avg
1 0.6923 0.6618 0.6792 0.6778
2 0.6918 0.6895 0.6771 0.6861
3 0.7022 0.6820 0.6937 0.6926
4 0.7541 0.7448 0.7557 0.7515
5 0.6403 0.6515 0.6710 0.6543
Mean ± Std 0.696 ± 0.041 0.686 ± 0.036 0.695 ± 0.035 0.692 ± 0.036

ESC-50 — Tuned (Macro-F1, 3-seed)

Fold Seed 0 Seed 1 Seed 2 3-seed avg
1 0.707 0.677 0.693 0.692
2 0.703 0.682 0.695 0.693
3 0.705 0.699 0.711 0.705
4 0.769 0.762 0.751 0.761
5 0.674 0.675 0.663 0.671
Mean ± Std 0.712 ± 0.035 0.699 ± 0.036 0.703 ± 0.032 0.705 ± 0.034

UrbanSound8K — Baseline (Macro-F1, 3-seed)

Fold Seed 0 Seed 1 Seed 2 3-seed avg
1 0.7337 0.7243 0.7072 0.7217
2 0.7113 0.7356 0.7399 0.7289
3 0.6707 0.6886 0.6677 0.6757
4 0.6955 0.7018 0.7233 0.7069
5 0.7825 0.7625 0.7560 0.7670
6 0.7792 0.7417 0.7864 0.7691
7 0.7506 0.7623 0.7338 0.7489
8 0.7693 0.7530 0.7572 0.7598
9 0.7370 0.7588 0.7565 0.7508
10 0.7893 0.8004 0.7865 0.7921
Mean ± Std 0.742 ± 0.040 0.743 ± 0.033 0.741 ± 0.036 0.742 ± 0.034

UrbanSound8K — Tuned (Macro-F1, 3-seed)

Fold Seed 0 Seed 1 Seed 2 3-seed avg
1 0.711 0.729 0.722 0.721
2 0.727 0.748 0.749 0.741
3 0.658 0.680 0.692 0.677
4 0.745 0.719 0.723 0.729
5 0.837 0.842 0.867 0.849
6 0.765 0.802 0.761 0.776
7 0.791 0.782 0.762 0.778
8 0.783 0.756 0.780 0.773
9 0.776 0.792 0.791 0.786
10 0.823 0.808 0.820 0.817
Mean ± Std 0.762 ± 0.054 0.766 ± 0.048 0.767 ± 0.051 0.765 ± 0.050

Optimized Hyperparameters

Parameter Search Range Best Value
lr [5e-4, 5e-3] log 9.680e-4
weight_decay [1e-6, 1e-3] log 9.302e-4
step_size [6, 12] int 10
gamma [0.5, 0.9] 0.699
epochs fixed 50
early_stop fixed off

HPO: Optuna TPE sampler (Akiba et al., 2019), seed=42, 30 trials, maximize eval macro-F1.


References

  • Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna: A Next-Generation Hyperparameter Optimization Framework. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2623–2631.
  • Bergstra, J., & Bengio, Y. (2012). Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(2), 281–305.
  • Kingma, D. P., & Ba, J. (2017). Adam: A Method for Stochastic Optimization. https://arxiv.org/abs/1412.6980
  • Piczak, K. J. (2015). ESC: Dataset for environmental sound classification. Proceedings of the 23rd ACM International Conference on Multimedia, 1015–1018.
  • Salamon, J., Jacoby, C., & Bello, J. P. (2014). A dataset and taxonomy for urban sound research. Proceedings of the 22nd ACM International Conference on Multimedia, 1041–1044.
  • Original codebase: ksanjeevan/crnn-audio-classification (MIT License)

License

This project extends ksanjeevan/crnn-audio-classification, originally released under the MIT License. See crnn-audio-classification/LICENSE for details.

About

CRNN-based audio tagging for environmental sound classification

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors