Author: Mark Slipenkyi
Date: 2025
Task: Multi-label audio classification and time-frequency event detection.
This project implements a Convolutional Recurrent Neural Network (CRNN) to identify and localize bird species in audio recordings. The model processes log-mel spectrograms to provide:
- Species Identification: Probability scores across 28 bird species.
- Temporal Localization: Identification of when a bird is singing.
- Frequency Localization: Identification of the frequency range (Hertz) of the vocalization.
The data is sourced from the Fully-annotated soundscape recordings from the Southwestern Amazon Basin.
- Audio: 21 hours of 48 kHz recordings (resampled to 32 kHz).
- Labels: 14,798 manually annotated time-frequency bounding boxes.
- Scope: Focused on the 28 most frequent species (those with >150 occurrences) to mitigate class imbalance.
ββββsrc
β β data_review.ipynb # Exploratory Data Analysis
β β train_iso.ipynb # Isolated training experiments
β ββββclasses
β β BirdDataset.py # Custom PyTorch Dataset (Log-Mel conversion)
β β BirdModels.py # CRNN Architecture (CNN + GRU)
β ββββcommon
β β metrics.py # Macro-Averaged F1, Dice, and Precision/Recall
β β plotting.py # Result visualization
β ββββscripts
β train_config.json # Hyperparameters
β train_script.py # Training pipeline in script form
ββββstable_logs # Saved metrics and training plots
The model utilizes a CNN backbone for spatial feature extraction from spectrograms, followed by Gated Recurrent Units (GRU) to capture the temporal rhythm of bird songs.
- Primary Metric: Macro-Averaged F1 Score.
- Loss Function:
BCEWithLogitsLosswith positive weights for sparse labels.
The model demonstrates relative precision in identifying most recurring species in various environments with multiple overlapping vocalizations.
To improve the signal-to-noise ratio, the training dataset was augmented by removing background noise in Adobe Audition.
Because bird vocalizations are sparse, all classes were boosted during training using pos_weight in the BCEWithLogitsLoss function.
Below are examples of the model's ability to detect species like the Thrush-like Wren and Amazonian Grosbeak.

