A demo of the model's performances is available at the docs site.
- Model: DenoiseNet, a U-Net mask estimator trained on log-magnitude STFT features with multi-objective losses (BCE on IBM, L1 on linear and mel magnitudes, waveform L1).
- Training corpus: ~2.5 hours of English speech mixed with babble noise at controlled SNR.
- Inference: streaming-ready; runs in (near) real time on a standard laptop CPU.
- Results: significant SNR improvements on test set; subjective quality gains observed. Near state-of-the-art performance among lightweight denoising models.
- Training script: src/training/train.py.
- Inference script: src/inference/inference.py.
- Configuration centralised in src/utils/constants.py.
- Data: data/train (speech
.pt, noise.pt), data/test (raw.wav, converted speech.pt, enhanced outputs), pre-trained weights in data/models. - Source: src with subpackages
training,inference,models,utils. - Experiments: experiments/logs (CSV logs) and experiments/checkpoints (epoch checkpoints).
- Scripts: scripts/convert_wav.py for test-time WAV→PT conversion.
- Docs site (static demo): docs.
- Recommended: Python 3.10+ with
torch,torchaudio,speechbrain,numpy,scipy,tqdm,matplotlib(optional for debug). - Example (from repo root):
python -m venv .venv && source .venv/bin/activatepip install torch torchaudio numpy scipy tqdm matplotlib
- Imports use package-style paths (e.g.,
from utils.constants import *in src/training/train.py). Running as a module from insidesrcensures Python resolves these packages without manualPYTHONPATHedits. - Commands (run from the
srcdirectory):- Training:
python -m training.train - Inference:
python -m inference.inference
- Training:
- If you prefer running from the repo root, set
PYTHONPATH=src(e.g.,PYTHONPATH=src python -m training.train).
- Training expects int16 tensors:
data/train/speech/*.ptanddata/train/noise/*.pt. Files provided in the repository follow this format. - Test audio conversion: place
.wavfiles in data/test/raw and runpython -m scripts.convert_wavfrom the repo root. Converted.ptfiles are written to data/test/speech. - Mel filterbank: precomputed at src/training/mel_fb_512_80_16000.pt. If you change FFT or mel settings, regenerate with src/utils/create_filterbank.py.
- Edit src/utils/constants.py to change:
- Data paths (
ROOT,CLEAN_DIR,NOISE_DIR, test directories). - STFT params (
N_FFT,HOP_LENGTH,WIN_LENGTH,N_MELS). - Training params (
EPOCHS,BATCH_SIZE,LEARNING_RATE, loss weightsLAMBDA,GAMMA,OMEGA,ZETA, mel weightALPHA). - Phase reconstruction (
PHASE_MODEin {raw,GL,vocoder} andGL_ITERS). - Logging/output toggles (
SAVE_DENOISED,SAVE_NOISY) and model selection (MODEL_NAME).
- Data paths (
- Working directory:
src(module mode). - Command:
python -m training.train - The script prompts for
session_name(used to namespace logs and checkpoints). - Internals (see src/training/train.py):
- Dataset: SpeechNoiseDataset mixes clean and noise at
TARGET_SNR, computes log-magnitude features, IBM labels, and phases. - Dataloaders: random 85/15 train/val split with seed 42, padding via utils/pad_collate.py.
- Model: DenoiseNet implementation in models/DenoiseUNet.py predicting time-frequency masks.
- Loss: normalized multi-term loss combining BCE, linear L1, mel L1, and waveform L1.
- Checkpoints: saved each epoch to experiments/checkpoints/<session_name>. Final weights: data/models/<session_name>.pth.
- Logs: per-epoch CSV at experiments/logs/<session_name>/training_log.csv.
- Dataset: SpeechNoiseDataset mixes clean and noise at
- To resume or continue with different hyperparameters, edit
constants.py, keep the samesession_nameif you wish to append logs, or choose a new one to avoid overwrite.
- Working directory:
src(module mode). - Ensure
MODEL_NAMEin src/utils/constants.py points to a weight file in data/models (e.g.,waveform-3.pth). - Ensure test
.ptspeech files exist in data/test/speech; use the conversion script if starting from.wav. - Command:
python -m inference.inference - Internals (see src/inference/inference.py):
- Loads
SpeechNoiseDatasetintestmode (adds filenames), batch size 1 with padding. - Predicts mask, reconstructs magnitude; phase via
PHASE_MODE(rawuses mixture phase andtorch.istft,GLuses Griffin-Lim,vocoderplaceholder not implemented). - Saves enhanced (and optionally noisy) audio to data/test/enhanced and logs SNR per file to experiments/logs/<MODEL_NAME>/inference_snr_log.csv.
- Reports per-file and average inference time.
- Loads
- Randomness: validation split uses a fixed seed (42); other loaders follow PyTorch defaults (set
torch.manual_seedexternally if stricter determinism is required). - Data: training uses the provided
.pttensors; ensure any new data follows the same int16 tensor convention and sample rate (SAMPLE_RATEin constants). - Hyperparameters and architecture: fully specified in src/utils/constants.py and src/models/DenoiseUNet.py.
- Artifacts: checkpoints and logs are versioned by session/model names; retain these along with the exact
constants.pysnapshot to reproduce results.
- Import errors (e.g.,
No module named utils): run commands fromsrcor setPYTHONPATH=src. - Missing data: verify
.ptfiles in data/train/speech and data/train/noise; for test, populate data/test/raw and reconvert. - Phase mode errors:
PHASE_MODE='vocoder'is not implemented; useraworGL.
If you build on this work, please cite the repository and describe DenoiseNet as “a U-Net mask-based speech denoising model trained with combined BCE, linear/mel L1, and waveform losses.”