Skip to content

wkzng/iSincNet

Repository files navigation

iSincNet (Lightweight Sincnet Spectrogram Vocoder)

[Blog] [Original SincNet Paper (M. Ravenelli, Y. Bengio)]

iSincNet is as Fast and Lightweight Sincnet Spectrogram Vocoder neural network trained to reconstruct audio waveforms from their SincNet spectogram (real and signed 2d representation). We used the GTZAN dataset which is the most-used public dataset for evaluation in machine listening research for music genre recognition (MGR). The files were collected in 2000-2001 from a variety of sources including personal CDs, radio, microphone recordings, in order to represent a variety of recording conditions (http://marsyas.info/downloads/datasets.html).

Fast and Lightweight Sincnet Spectrogram Vocoder

Datasets used during development:

Example Spectrogram

The First 5s second of the Audio audio/invertibility/15033000.mp3

Non-causal Encoder Causal Encoder
signed values non-causal 15033000 causal 15033000
abs values non-causal 15033000 causal 15033000

Effect of applying sincnet envelope

As discussed in Section 2.1, SincNet can be recast as a standard wavelet transform with an envelopped defined by the sinc depending explicitly on the bandwidths as envelope(x, B) = sinc(B x / 2). As a consequen the orignal cos and sine components of the filter are modulated (see example below, where we show causal filters).

Kernel index=10 index=104
Without Sinc Envelope non-causal 15033000 causal 15033000
With Sinc Envelope non-causal 15033000 causal 15033000

At lower freauencies (~low indices), the sinc envelope's effect are negligible unlike higher frequency where it forced the filter to be more localised.

🎧 Pretrained Models

The following table summarizes the key characteristics and access points for the available pretrained models. All models are open-source and stored in the pretrained/ folder.

Sample Rate FPS #Bins Weights Corpus Causal Encoder Scale Sinc Envelope Open-Source
16000 128 128 📦 GTZAN Linear
16000 128 128 📦 GTZAN Linear
16000 128 128 📦 GTZAN Mel
16000 128 256 📦 GTZAN Mel
16000 128 512 📦 GTZAN Mel
16000 128 128 📦 GTZAN Mel
16000 128 128 📦 GTZAN Mel
44100 350 128 📦 GTZAN Linear
44100 350 128 📦 GTZAN Mel
44100 350 256 📦 GTZAN Mel

Quick Start

pip install -r requirements.txt

Please refer to the demo notebook which shows how to load and use the model

import numpy as np
import librosa
import torch
from sincnet.model import SincNet
from datasets.utils.waveform import WaveformLoader 


SAMPLE_RATE = 16_000
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
audio_loader = WaveformLoader(sample_rate=SAMPLE_RATE) 

# load the model
params = {
    "fs": SAMPLE_RATE,
    "fps": 128,
    "scale": "mel",
    "component": "complex",
    "causal": True,
    "q_bits": 8 
}

model : SincNet = (
    SincNet(**params)
    .load_pretrained_weights(weights_folder="pretrained", verbose=False)
    .eval()
    .to(device)
)

# encode and decode an audio waveform
duration = 5
offset = 0
audio_path = ... 
waveform = audio_loader.load_segment(audio_path, offset=0, duration=5, nchannels=1)
loudness = audio_loader.measure_loudness(waveform)
waveform = audio_loader.normalise_loudness(waveform, loudness, target_lufs=-23)

with torch.no_grad():
  audio_tensor = torch.from_numpy(waveform).to(device).float()
  spectrogram = model.encode(audio_tensor.unsqueeze(0), quantize=True)
  reconstructed_audio_tensor = model.decode(spectrogram, dequantize=True)

References Papers and Related Topics

  • [1] Mirco Ravanelli, Yoshua Bengio, “Speaker Recognition from raw waveform with SincNet” Arxiv

  • [2] MS-SincResNet: Joint Learning of 1D and 2D Kernels Using Multi-scale SincNet and ResNet for Music Genre Classification Arxiv

  • [3] Curricular SincNet: Towards Robust Deep Speaker Recognition by Emphasizing Hard Samples in Latent Space Arxiv

  • [4] Interpretable SincNet-based Deep Learning for Emotion Recognition from EEG brain activity Arxiv

  • [5] Toward end-to-end interpretable convolutional neural networks for waveform signals Arxiv

  • [6] Filterband design for end-to-end speech separation Arxiv. This paper decomposes sinNet into a product sin * cos as implemented in this repo and bridgin the gap with Gabor filterbank

  • [7] PF-Net: Personalized Filter for Speaker Recognition from Raw Waveform Arxiv. This paper proposes to extend SincNet for more flexiblity by allowing alternative shapes to rectangle function in the spectral domain

  • [8] MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis Arxiv

  • [9] iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform Arxiv

  • [10] iSTFTNet2: Faster and More Lightweight iSTFT-Based Neural Vocoder Using 1D-2D CNN Arxiv

  • [11] Deep Griffin-Lim Iteration Arxiv

  • [12] Mel-Spectrogram Inversion via Alternating Direction Method of Multipliers Arxiv

  • [13] HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis Arxiv

Related discussion about SincNet vs STFT mravanelli/SincNet#74

Usages and Implementations around SincNet

Roadmap and projects status

  • Host weights in Github and add auto-download
  • Benchmark of inversion vs Griffin-Lim, iSTFTNet

About

Lightweight inversion of causal and non-causal SincNet spectrogram

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors