This repository contains code for an academic project that extracts audio features (MFCC, STFT, DWT), converts human responses to vectorized labels, and trains neural models (CNN / CNN→Transformer / multimodal attention) to predict listener yes/no song attributes and favorites. The code was written and run as part of a class project and uses a saved tensor dataset (data_tensors.pth) for model training and evaluation.
IMPORTANT: the audio files used to produce data_tensors.pth are not included in this repository due to copyright and file size limits. See the "Preparing audio data" section below for how to supply your own audio files so you can reproduce preprocessing and training.
-
Primary tasks implemented
- Convert survey responses CSV (
responses.csv) into structured JSON (output.json) — inanalysis.py. - Extract MFCC / STFT / Wavelet (DWT) features from audio and build normalized tensors — in
dataSets.py. - Train and evaluate neural models using the saved tensors (
data_tensors.pth) — inmfcc_rnn.py,multimodal.py, and related scripts. - Visualize example features (
visualizeData.py).
- Convert survey responses CSV (
-
Example artifacts generated by the pipeline
output.json— JSON array produced fromresponses.csv(survey → labels).data_tensors.pth— a PyTorch file containing preprocessed tensors for MFCC, STFT, DWT and train/test labels.- Figures printed/shown by
visualizeData.py(MFCC / DWT / STFT examples).
analysis.py— CSV → JSON conversion and vector encoding logic used to createdata.json/output.jsonfromresponses.csv.dataSets.py— feature extraction (MFCC, STFT, DWT), normalization, and conversion to PyTorch tensors; savesdata_tensors.pth.mfcc.py,mfcc_rnn.py— model architectures and training loops that use the MFCC tensors.multimodal.py— multimodal model combining DWT, STFT, and MFCC branches + attention and training loop.visualizeData.py— quick plotting script to inspect MFCC / DWT / STFT tensors fromdata_tensors.pth.dwt.py,stft.py— helper/feature code (if present) referenced by feature extraction functions.data.json— mapping of inputs (song identifiers) to label vectors (used bydataSets.py).dummy_clustered_data.json— extra data used as augmentation/extra training examples indataSets.py.responses.csv— raw survey CSV used to buildoutput.json(included in repo).data_tensors.pth— saved tensors (already present in this repo) so training/evaluation can run without rerunning feature extraction.
- Python 3.8–3.11 recommended (tested on macOS).
- The project uses the following Python packages (approx):
- torch (PyTorch)
- torchaudio (optional; librosa used here)
- librosa
- numpy
- scipy
- scikit-learn
- matplotlib
- pywt (PyWavelets)
- pandas
You can install the common dependencies with pip:
python3 -m pip install torch librosa numpy scipy scikit-learn matplotlib pywt pandasIf you have a CUDA-enabled GPU and a compatible PyTorch build, install torch according to PyTorch's instructions for your CUDA version.
The feature extraction code in dataSets.py expects audio files (MP3) to be available in ../Input Songs/ relative to dataSets.py. The functions call librosa.load('../Input Songs/' + song_path + '.mp3', ...) so the project expects a directory structure like:
<project-root>/Code/dataSets.py
<project-root>/Input Songs/<song-id>.mp3
- The
song_pathvalues come fromdata.json/output.json. Ensure theinputvalue indata.jsonmatches the filename (without the.mp3extension). - The audio files used when this project was run are not included here because they are copyrighted and large.
If you want to reproduce the preprocessing and training steps:
- Create
Input Songsfolder at the repository root (or adjustdataSets.pyto point at your audio path). - Place MP3 files named exactly as the
inputfields indata.json(e.g.,MySong.mp3). - Optionally, open
dataSets.pyand confirmmax_pad_lenor other parameters to suit your audio durations.
Note: If you don't have the original audio files, a quick way to exercise the training code is to use the included data_tensors.pth (already in the repo). That file contains precomputed tensors and labels so you can run training and visualization without audio files.
- (Optional) Convert survey CSV to JSON (if you edited
responses.csv):
python3 analysis.py
# This runs `csv_to_json('responses.csv', 'output.json')` by default.- (Optional / To regenerate feature tensors from audio):
- Add your MP3 files to
../Input Songs/(see previous section), make suredata.jsonlists those inputs. - Run feature extraction to create
data_tensors.pth:
python3 dataSets.py
# This script will read `data.json` and `dummy_clustered_data.json`, extract MFCC/DWT/STFT, normalize and save `data_tensors.pth`.- Train or evaluate models (using pre-saved tensors):
- Train MFCC-based model (example):
python3 mfcc_rnn.py
# Uses ./data_tensors.pth to load X_train_mfcc, X_test_mfcc, y_train, y_test- Train the multimodal model (DWT + STFT + MFCC):
python3 multimodal.py
# Uses ./data_tensors.pth to load multi-modal tensors and trains the MultiModalAttentionModel- Run the simpler CNN training example (if present):
python3 mfcc.py- Visualize example features
python3 visualizeData.py
# This loads ./data_tensors.pth and shows example MFCC / DWT / STFT plots.data_tensors.pth— after runningdataSets.py, check that this file exists and contains keys likeX_train_mfcc,X_test_mfcc,X_train_dwt,X_train_stft,y_train,y_test, etc.- Training scripts will print epoch-by-epoch loss and final Hamming accuracy (or similar metric). Look for printed lines like:
Epoch [x/y], Loss: 0.xxx
Test Hamming Accuracy: 0.zzzz
visualizeData.pywill open Matplotlib figures showing MFCC / DWT / STFT examples.analysis.pywill writeoutput.jsonwhen run onresponses.csv.
- The
dataSets.pycode currently uses../Input Songs/+<song>.mp3. If your audio files are elsewhere, either move them or editdataSets.pyaccordingly. - The code uses
librosafor audio loading and feature extraction; differentlibrosaversions may produce slightly different results. If reproducibility matters, pinlibrosato the version you used. - For quick experimentation you can skip extraction and use the included
data_tensors.pthfile. - The model code is intentionally compact and experimental (educational project). It contains TODOs and places where hyperparameters / regularization can be tuned.
- If you see errors loading
data_tensors.pth, confirm you are in the repository root and the file path./data_tensors.pthis correct. - If
librosaraises an error loading MP3 files, ensureffmpegoraudioreadbackends are available on your system (installffmpegvia Homebrew on macOS:brew install ffmpeg). - If GPU/CPU errors occur when loading/saving tensors, confirm
torchversion and device availability. The code uses CPU tensors by default but will run on GPU if tensors and models are moved to CUDA (not done by default in these scripts).
- Install dependencies
- Place your MP3s in
Input Songs/(or usedata_tensors.pthprovided) - Run
python3 analysis.pyif you changedresponses.csv - Run
python3 dataSets.pyto generatedata_tensors.pth(skip if using the provideddata_tensors.pth) - Train models:
python3 mfcc_rnn.pyorpython3 multimodal.py - Visualize with
python3 visualizeData.py
This code is part of a class project. If you have questions about reproducing experiments or the input data, please open an issue or contact the project author (add contact details here).
Note: this README was generated to document the current codebase. The included data_tensors.pth allows running and testing models without the original audio files (which are not included for copyright and size reasons).