This is an in-progress repository for music genre classification task.
The aim of this project is to investigate different methods of feature engineering for audio data, as well as different models to perform classification on said feature representation.
As of now, this project is a work-in-progress and, thus, is not very well documented.
So please excuse me for that.
The project has currently implemented four different spectrogram features:
- MIDI pitch-based spectrogram: This is a binned-version of the power spectrogram where frequencies are binned to their closest MIDI note.
- Chroma-based spectrogram: This extends the previous format by further binning each MIDI into the respective tones of the 12-TET system.
- Mel-scale spectrogram: Similar to the MIDI spectrogram, this spectrogram also bins the frequency axis according to the Mel scale. For more information on the Mel scale, see here.
- Mel-frequency cepstrum coefficients: The MFCC extends the idea of Mel-scale spectrograms by performing an additional transform on a logarithmic Mel-scale spectrograms, creating a sepctrum-of-a-spectrum (hence the name 'cepstrum'). For more information, see here.
This project currently uses two datasets:
- The GTZAN dataset. You can unzip the dataset by running the following command:
unzip assets/gtzan.zip -d assets- The FMA dataset.
To download this dataset, please navigate to this
repository and follow their instructions. Note
that the
fma_metadata.zipMUST be downloaded as well, apart from the audio files.
There is a training script available at train.py, which encapsulates
the data loading, feature extraction, model building, training and validation.
To use the train script, install the necessary requirements:
- Python 3.13+
pip install -r requirements.txt- Appropriate installation of PyTorch with CUDA, see here.
Then, specify your training configuration in train_config.yml.
There are a variety of parameters you can configure. Some of which are required, some of which are not. The details of what each parameter means/does is described at the very bottom of the file.
Note that it is not required for you to use the exact same filename. In the case where you have multiple configurations you want to test out, you can specify them in different files with different filenames.
For gtzan, you only need to provide the root directory that contains the audios.
However, for fma, you must specify the root as two paths: the path to the metadata
directory and the path to the audio directory.
You may also use your own dataset, in which case, just specify the root path in train_config.yml.
You must ensure that all audio files in your dataset can be read by librosa.load.
Note that the implemented spectrogram transforms needs to know the sampling rate beforehand. Thus, any deviation in the given sampling rate will not give a correct result due to how frequency-binning works.
Furthermore, any deviation in the sampling rate will also cause the spectrograms to differ in the temporal dimension due to how STFT works, which will cause errors when collating samples into batches when training.
To deal with this, you need to specify a sampling rate to which the loaded
waveforms are normalized to in train_config.yml.
Finally, run training with:
python train.pyThis will set everything up according to the configurations given in train_config.yml.
You may also specify a different YAML configuration file with:
python train.py -cf path_to_fileAll metrics and hyperparameters are logged to Tensorboard.
During training (or even after training), you can view the reported metrics by running:
tensorboard --logdir path/to/your/ckpt_dirand then visiting http://localhost:6006. This URL will also be logged to your stdout when running
tensorboard, so you can also click there to get redirected instead.