crnn-at: CRNN for Environmental Sound Classification

CRNN-based audio tagging for environmental sound classification.

This repository extends the original ksanjeevan/crnn-audio-classification with:

Support for ESC-50 dataset (Piczak, 2015) in addition to the original UrbanSound8K dataset
Unified codebase for both ESC-50 and UrbanSound8K (dynamic num_classes)
Full k-fold cross-validation pipeline (5-fold ESC-50, 10-fold UrbanSound8K)
Reproducible multi-seed training via SLURM array jobs
Automated hyperparameter optimization via Optuna TPE sampler
Batch evaluation, post-processing, and statistical comparison scripts

This repository accompanies the master's thesis: "Comparative Analysis of Deep Learning Architectures for Audio Tagging and Sound Event Detection" Yu Chia Kuo, McGill University, 2026.

Model Architecture

CRNN: 3-layer CNN feature extractor → 2-layer LSTM → Linear classifier

AudioCRNN(
  (spec): MelspectrogramStretch(num_bands=128, fft_len=2048, norm=spec_whiten)
  (net):
    (convs): Conv2d(1→32→64→64) + BN + ELU + MaxPool + Dropout(0.1) × 3
    (recur): LSTM(128, 64, num_layers=2)
    (dense): Dropout(0.3) + BN1d + Linear(64 → num_classes)
)
Trainable parameters: ~256K (10-class) / ~261K (50-class)

Architecture defined in crnn.cfg via torchparse. The final Linear layer output size is dynamically set to num_classes at runtime.

Repository Structure

crnn-at/
├── configs/
│   ├── esc_folds/
│   │   ├── baseline/             config_fold{1-5}.json
│   │   ├── baseline_with_es/     config_fold{1-5}.json
│   │   └── optuna_top1/          fold{1-5}_TOP1.json
│   └── urban_folds/
│       ├── baseline/             config_fold{1-10}.json
│       ├── baseline_with_es/     config_fold{1-10}.json
│       └── optuna_top1/          fold{1-10}_TOP1.json
├── crnn-audio-classification/
│   ├── data/                     CSVDataManager, transforms
│   ├── eval/                     ClassificationEvaluator
│   ├── net/                      AudioCRNN, loss, metrics
│   ├── train/                    Trainer
│   ├── utils/
│   ├── optuna/                   HPO scripts
│   ├── run.py                    Main entry point (train / eval)
│   ├── batch_eval.py             Batch evaluation across folds
│   ├── convert_all_recursive.py  Checkpoint conversion
│   ├── crnn_post_local.py        Post-processing (summary, confusion matrix)
│   ├── compare_baseline_tuned.py Statistical comparison
│   └── crnn.cfg                  Network architecture definition
├── data/                         Symlinks to dataset directories
├── torchaudio-contrib/           Audio processing utilities
├── run_array.sh                  SLURM array job script
├── run_optuna.sh                 SLURM Optuna HPO job script
└── requirements.txt

Setup

1. Create virtual environment

python3.11 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

2. Install additional dependencies

# torchaudio-contrib (from included subdir)
cd torchaudio-contrib && pip install -e . && cd ..

# torchparse (architecture parser)
pip install git+https://github.com/ksanjeevan/torchparse.git

3. Prepare datasets

Download and extract the datasets, then create symlinks:

mkdir -p data
ln -s /path/to/ESC-50-master data/ESC-50-master
ln -s /path/to/UrbanSound8K  data/UrbanSound8K

Update configs/ JSON files if your dataset paths differ from ../data/.

4. (Optional) Configure environment variables for SLURM

Create an env.sh file (not tracked by git):

cat << 'EOF' > env.sh
export VENV_PATH=/path/to/venv/bin/activate
export PROJECT_ROOT=/path/to/crnn-at
EOF

Source it before submitting SLURM jobs:

source env.sh

Usage

Training (single run)

cd crnn-audio-classification

# ESC-50 fold 1, baseline
python run.py train -c ../configs/esc_folds/baseline/config_fold1.json --cfg crnn.cfg

# ESC-50 fold 1, Optuna TOP1, seed=0
python run.py train -c ../configs/esc_folds/optuna_top1/fold1_TOP1.json --cfg crnn.cfg --seed 0

Training (SLURM array jobs)

cd /path/to/crnn-at
source env.sh

# Baseline — all folds
DATASET=esc   sbatch --array=1-5  run_array.sh
DATASET=urban sbatch --array=1-10 run_array.sh

# Optuna TOP1 — 3 seeds
for seed in 0 1 2; do
    DATASET=esc   CONFIG_TYPE=optuna_top1 SEED=$seed sbatch --array=1-5  run_array.sh
    DATASET=urban CONFIG_TYPE=optuna_top1 SEED=$seed sbatch --array=1-10 run_array.sh
done

Evaluation pipeline

cd crnn-audio-classification

# 1. Convert checkpoints
python convert_all_recursive.py

# 2. Batch evaluation
python batch_eval.py --base_dir saved_cv/esc/baseline --out eval_results/esc_baseline

# 3. Post-processing (summary CSV, pooled confusion matrix)
python crnn_post_local.py --dataset esc50 --eval_dir eval_results/esc_baseline

# 4. Statistical comparison (baseline vs tuned)
python compare_baseline_tuned.py

Evaluation (single checkpoint)

python run.py eval -r path/to/model_clean.pth

Results

All experiments use 3 random seeds (0, 1, 2) per fold. Reported Macro-F1 values are per-seed results and 3-seed averages. Optuna hyperparameters were tuned on ESC-50 fold 1 only (30 TPE trials), then transferred directly to all folds and to UrbanSound8K without dataset-specific retuning.

Summary

Dataset	Baseline F1	Tuned F1	Δ F1
ESC-50 (5-fold)	0.692 ± 0.036	0.705 ± 0.034	+0.012
UrbanSound8K (10-fold)	0.742 ± 0.034	0.765 ± 0.050	+0.023

ESC-50 — Baseline (Macro-F1, 3-seed)

Fold	Seed 0	Seed 1	Seed 2	3-seed avg
1	0.6923	0.6618	0.6792	0.6778
2	0.6918	0.6895	0.6771	0.6861
3	0.7022	0.6820	0.6937	0.6926
4	0.7541	0.7448	0.7557	0.7515
5	0.6403	0.6515	0.6710	0.6543
Mean ± Std	0.696 ± 0.041	0.686 ± 0.036	0.695 ± 0.035	0.692 ± 0.036

ESC-50 — Tuned (Macro-F1, 3-seed)

Fold	Seed 0	Seed 1	Seed 2	3-seed avg
1	0.707	0.677	0.693	0.692
2	0.703	0.682	0.695	0.693
3	0.705	0.699	0.711	0.705
4	0.769	0.762	0.751	0.761
5	0.674	0.675	0.663	0.671
Mean ± Std	0.712 ± 0.035	0.699 ± 0.036	0.703 ± 0.032	0.705 ± 0.034

UrbanSound8K — Baseline (Macro-F1, 3-seed)

Fold	Seed 0	Seed 1	Seed 2	3-seed avg
1	0.7337	0.7243	0.7072	0.7217
2	0.7113	0.7356	0.7399	0.7289
3	0.6707	0.6886	0.6677	0.6757
4	0.6955	0.7018	0.7233	0.7069
5	0.7825	0.7625	0.7560	0.7670
6	0.7792	0.7417	0.7864	0.7691
7	0.7506	0.7623	0.7338	0.7489
8	0.7693	0.7530	0.7572	0.7598
9	0.7370	0.7588	0.7565	0.7508
10	0.7893	0.8004	0.7865	0.7921
Mean ± Std	0.742 ± 0.040	0.743 ± 0.033	0.741 ± 0.036	0.742 ± 0.034

UrbanSound8K — Tuned (Macro-F1, 3-seed)

Fold	Seed 0	Seed 1	Seed 2	3-seed avg
1	0.711	0.729	0.722	0.721
2	0.727	0.748	0.749	0.741
3	0.658	0.680	0.692	0.677
4	0.745	0.719	0.723	0.729
5	0.837	0.842	0.867	0.849
6	0.765	0.802	0.761	0.776
7	0.791	0.782	0.762	0.778
8	0.783	0.756	0.780	0.773
9	0.776	0.792	0.791	0.786
10	0.823	0.808	0.820	0.817
Mean ± Std	0.762 ± 0.054	0.766 ± 0.048	0.767 ± 0.051	0.765 ± 0.050

Optimized Hyperparameters

Parameter	Search Range	Best Value
lr	[5e-4, 5e-3] log	9.680e-4
weight_decay	[1e-6, 1e-3] log	9.302e-4
step_size	[6, 12] int	10
gamma	[0.5, 0.9]	0.699
epochs	fixed	50
early_stop	fixed off	—

HPO: Optuna TPE sampler (Akiba et al., 2019), seed=42, 30 trials, maximize eval macro-F1.

References

Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna: A Next-Generation Hyperparameter Optimization Framework. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2623–2631.
Bergstra, J., & Bengio, Y. (2012). Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(2), 281–305.
Kingma, D. P., & Ba, J. (2017). Adam: A Method for Stochastic Optimization. https://arxiv.org/abs/1412.6980
Piczak, K. J. (2015). ESC: Dataset for environmental sound classification. Proceedings of the 23rd ACM International Conference on Multimedia, 1015–1018.
Salamon, J., Jacoby, C., & Bello, J. P. (2014). A dataset and taxonomy for urban sound research. Proceedings of the 22nd ACM International Conference on Multimedia, 1041–1044.
Original codebase: ksanjeevan/crnn-audio-classification (MIT License)

License

This project extends ksanjeevan/crnn-audio-classification, originally released under the MIT License. See crnn-audio-classification/LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
configs		configs
crnn-audio-classification		crnn-audio-classification
torchaudio-contrib		torchaudio-contrib
.gitignore		.gitignore
README.md		README.md
my-config-esc.json		my-config-esc.json
my-config-urban.json		my-config-urban.json
requirements.txt		requirements.txt
run_array.sh		run_array.sh
run_optuna.sh		run_optuna.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

crnn-at: CRNN for Environmental Sound Classification

Model Architecture

Repository Structure

Setup

1. Create virtual environment

2. Install additional dependencies

3. Prepare datasets

4. (Optional) Configure environment variables for SLURM

Usage

Training (single run)

Training (SLURM array jobs)

Evaluation pipeline

Evaluation (single checkpoint)

Results

Summary

ESC-50 — Baseline (Macro-F1, 3-seed)

ESC-50 — Tuned (Macro-F1, 3-seed)

UrbanSound8K — Baseline (Macro-F1, 3-seed)

UrbanSound8K — Tuned (Macro-F1, 3-seed)

Optimized Hyperparameters

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

crnn-at: CRNN for Environmental Sound Classification

Model Architecture

Repository Structure

Setup

1. Create virtual environment

2. Install additional dependencies

3. Prepare datasets

4. (Optional) Configure environment variables for SLURM

Usage

Training (single run)

Training (SLURM array jobs)

Evaluation pipeline

Evaluation (single checkpoint)

Results

Summary

ESC-50 — Baseline (Macro-F1, 3-seed)

ESC-50 — Tuned (Macro-F1, 3-seed)

UrbanSound8K — Baseline (Macro-F1, 3-seed)

UrbanSound8K — Tuned (Macro-F1, 3-seed)

Optimized Hyperparameters

References

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages