This project implements a deep learning pipeline for speech and music dereverberation using a U-Net / GAN architecture. It is designed to enhance audio quality by removing reverberation while maintaining perceptual fidelity.
Key features include:
- DSP preprocessing: High-Pass Filter (HPF) and Weighted Prediction Error (WPE) dereverberation
- Spectrogram conversion for deep learning models
- Generator & Discriminator (GAN) training with adversarial + L1 loss
- Evaluation metrics: PESQ (speech quality) and SDR (music distortion ratio)
- Model complexity analysis: GMACs calculation for efficiency
Install the required Python packages for audio processing, model training, and evaluation:
- PyTorch
- Librosa
- Pyroomacoustics
- Pesq & Pystoi
- Ptflops
- Torch-AudioAugmentations
Ensure your environment supports GPU acceleration for faster training.
- Installations: Set up dependencies and libraries.
- Imports: Load core libraries like PyTorch, NumPy, Pandas, Librosa, and Pyroomacoustics.
- Config Class: Defines dataset paths, sample rate, FFT parameters, WPE settings, and model hyperparameters.
- DSP Front-End: Implements high-pass filtering, WPE dereverberation, and full DSP preprocessing.
- Models:
- TinyUNet for lightweight evaluation
- GeneratorUNet + PatchGANDiscriminator for adversarial training
- Dataset Class: Loads paired reverberant and clean audio, computes STFT, and returns spectrograms & raw waveforms.
- GAN Training Loop:
- Trains generator and discriminator
- Includes validation at each epoch
- Saves final weights for inference
- Audio Playback & Visualization: Converts predicted spectrograms back to audio, plays audio samples, and visualizes spectrograms.
- Model Complexity (GMACs): Calculates GMACs and parameter count for 1-second audio input.
- Evaluation Metrics: Computes PESQ for speech quality and SDR for music quality.
- Provide a CSV file listing paths to reverberant and clean WAV files.
- Ensure folder structure is correct and paths are updated in the configuration.
- Run the GAN training pipeline to optimize both generator and discriminator networks.
- The process includes epoch-wise validation and saving final weights.
- Convert predicted spectrograms to waveform audio.
- Play original reverberant, predicted dereverberated, and clean ground truth audio.
- Plot spectrogram comparisons.
- Evaluate GMACs and parameters to ensure the model fits within the computational budget.
- PESQ: Objective measure of speech quality
- SDR: Signal-to-Distortion Ratio for music quality
Metrics are automatically computed for validation datasets.
- GMACs: Model efficiency (budget < 50 GMAC/s)
- PESQ: Perceptual evaluation of speech quality
- SDR: Signal-to-Distortion Ratio for music fidelity
- Original Reverberant Audio: Input audio with room reverberation
- Predicted Dereverberated Audio: Model output
- Clean Ground Truth Audio: Reference audio
- Spectrogram Plots: Input, predicted, and target comparison
- Dataset path verification
- GMACs check (<50)
- Training and validation loss monitoring
- PESQ and SDR evaluation for performance benchmarking