Audio Multimodal Tagging & Preference Modeling

This repository contains code for an academic project that extracts audio features (MFCC, STFT, DWT), converts human responses to vectorized labels, and trains neural models (CNN / CNN→Transformer / multimodal attention) to predict listener yes/no song attributes and favorites. The code was written and run as part of a class project and uses a saved tensor dataset (data_tensors.pth) for model training and evaluation.

IMPORTANT: the audio files used to produce data_tensors.pth are not included in this repository due to copyright and file size limits. See the "Preparing audio data" section below for how to supply your own audio files so you can reproduce preprocessing and training.

At-a-glance

Primary tasks implemented
- Convert survey responses CSV (responses.csv) into structured JSON (output.json) — in analysis.py.
- Extract MFCC / STFT / Wavelet (DWT) features from audio and build normalized tensors — in dataSets.py.
- Train and evaluate neural models using the saved tensors (data_tensors.pth) — in mfcc_rnn.py, multimodal.py, and related scripts.
- Visualize example features (visualizeData.py).
Example artifacts generated by the pipeline
- output.json — JSON array produced from responses.csv (survey → labels).
- data_tensors.pth — a PyTorch file containing preprocessed tensors for MFCC, STFT, DWT and train/test labels.
- Figures printed/shown by visualizeData.py (MFCC / DWT / STFT examples).

Repository structure (key files)

analysis.py — CSV → JSON conversion and vector encoding logic used to create data.json / output.json from responses.csv.
dataSets.py — feature extraction (MFCC, STFT, DWT), normalization, and conversion to PyTorch tensors; saves data_tensors.pth.
mfcc.py, mfcc_rnn.py — model architectures and training loops that use the MFCC tensors.
multimodal.py — multimodal model combining DWT, STFT, and MFCC branches + attention and training loop.
visualizeData.py — quick plotting script to inspect MFCC / DWT / STFT tensors from data_tensors.pth.
dwt.py, stft.py — helper/feature code (if present) referenced by feature extraction functions.
data.json — mapping of inputs (song identifiers) to label vectors (used by dataSets.py).
dummy_clustered_data.json — extra data used as augmentation/extra training examples in dataSets.py.
responses.csv — raw survey CSV used to build output.json (included in repo).
data_tensors.pth — saved tensors (already present in this repo) so training/evaluation can run without rerunning feature extraction.

Requirements

Python 3.8–3.11 recommended (tested on macOS).
The project uses the following Python packages (approx):
- torch (PyTorch)
- torchaudio (optional; librosa used here)
- librosa
- numpy
- scipy
- scikit-learn
- matplotlib
- pywt (PyWavelets)
- pandas

You can install the common dependencies with pip:

python3 -m pip install torch librosa numpy scipy scikit-learn matplotlib pywt pandas

If you have a CUDA-enabled GPU and a compatible PyTorch build, install torch according to PyTorch's instructions for your CUDA version.

Preparing audio data (important)

The feature extraction code in dataSets.py expects audio files (MP3) to be available in ../Input Songs/ relative to dataSets.py. The functions call librosa.load('../Input Songs/' + song_path + '.mp3', ...) so the project expects a directory structure like:

<project-root>/Code/dataSets.py
<project-root>/Input Songs/<song-id>.mp3

The song_path values come from data.json / output.json. Ensure the input value in data.json matches the filename (without the .mp3 extension).
The audio files used when this project was run are not included here because they are copyrighted and large.

If you want to reproduce the preprocessing and training steps:

Create Input Songs folder at the repository root (or adjust dataSets.py to point at your audio path).
Place MP3 files named exactly as the input fields in data.json (e.g., MySong.mp3).
Optionally, open dataSets.py and confirm max_pad_len or other parameters to suit your audio durations.

Note: If you don't have the original audio files, a quick way to exercise the training code is to use the included data_tensors.pth (already in the repo). That file contains precomputed tensors and labels so you can run training and visualization without audio files.

Quick run guide (minimal reproducible steps)

(Optional) Convert survey CSV to JSON (if you edited responses.csv):

python3 analysis.py
# This runs `csv_to_json('responses.csv', 'output.json')` by default.

(Optional / To regenerate feature tensors from audio):

Add your MP3 files to ../Input Songs/ (see previous section), make sure data.json lists those inputs.
Run feature extraction to create data_tensors.pth:

python3 dataSets.py
# This script will read `data.json` and `dummy_clustered_data.json`, extract MFCC/DWT/STFT, normalize and save `data_tensors.pth`.

Train or evaluate models (using pre-saved tensors):

Train MFCC-based model (example):

python3 mfcc_rnn.py
# Uses ./data_tensors.pth to load X_train_mfcc, X_test_mfcc, y_train, y_test

Train the multimodal model (DWT + STFT + MFCC):

python3 multimodal.py
# Uses ./data_tensors.pth to load multi-modal tensors and trains the MultiModalAttentionModel

Run the simpler CNN training example (if present):

python3 mfcc.py

Visualize example features

python3 visualizeData.py
# This loads ./data_tensors.pth and shows example MFCC / DWT / STFT plots.

Expected outputs and how to verify

data_tensors.pth — after running dataSets.py, check that this file exists and contains keys like X_train_mfcc, X_test_mfcc, X_train_dwt, X_train_stft, y_train, y_test, etc.
Training scripts will print epoch-by-epoch loss and final Hamming accuracy (or similar metric). Look for printed lines like:

Epoch [x/y], Loss: 0.xxx
Test Hamming Accuracy: 0.zzzz

visualizeData.py will open Matplotlib figures showing MFCC / DWT / STFT examples.
analysis.py will write output.json when run on responses.csv.

Notes, tips & caveats

The dataSets.py code currently uses ../Input Songs/ + <song>.mp3. If your audio files are elsewhere, either move them or edit dataSets.py accordingly.
The code uses librosa for audio loading and feature extraction; different librosa versions may produce slightly different results. If reproducibility matters, pin librosa to the version you used.
For quick experimentation you can skip extraction and use the included data_tensors.pth file.
The model code is intentionally compact and experimental (educational project). It contains TODOs and places where hyperparameters / regularization can be tuned.

Troubleshooting

If you see errors loading data_tensors.pth, confirm you are in the repository root and the file path ./data_tensors.pth is correct.
If librosa raises an error loading MP3 files, ensure ffmpeg or audioread backends are available on your system (install ffmpeg via Homebrew on macOS: brew install ffmpeg).
If GPU/CPU errors occur when loading/saving tensors, confirm torch version and device availability. The code uses CPU tensors by default but will run on GPU if tensors and models are moved to CUDA (not done by default in these scripts).

Small reproducibility checklist (one-liner)

Install dependencies
Place your MP3s in Input Songs/ (or use data_tensors.pth provided)
Run python3 analysis.py if you changed responses.csv
Run python3 dataSets.py to generate data_tensors.pth (skip if using the provided data_tensors.pth)
Train models: python3 mfcc_rnn.py or python3 multimodal.py
Visualize with python3 visualizeData.py

Contact & license

This code is part of a class project. If you have questions about reproducing experiments or the input data, please open an issue or contact the project author (add contact details here).

Note: this README was generated to document the current codebase. The included data_tensors.pth allows running and testing models without the original audio files (which are not included for copyright and size reasons).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Audio Multimodal Tagging & Preference Modeling

At-a-glance

Repository structure (key files)

Requirements

Preparing audio data (important)

Quick run guide (minimal reproducible steps)

Expected outputs and how to verify

Notes, tips & caveats

Troubleshooting

Small reproducibility checklist (one-liner)

Contact & license

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
analysis.py		analysis.py
baseModels.py		baseModels.py
data.json		data.json
dataSets.py		dataSets.py
data_tensors.pth		data_tensors.pth
dummy_clustered_data.json		dummy_clustered_data.json
dwt.py		dwt.py
extraTest.py		extraTest.py
mfcc.py		mfcc.py
mfcc_rnn.py		mfcc_rnn.py
multimodal.py		multimodal.py
output.json		output.json
responses.csv		responses.csv
sandbox.py		sandbox.py
stft.py		stft.py
visualizeData.py		visualizeData.py

zane-perry/NN-Music-Recommendation

Folders and files

Latest commit

History

Repository files navigation

Audio Multimodal Tagging & Preference Modeling

At-a-glance

Repository structure (key files)

Requirements

Preparing audio data (important)

Quick run guide (minimal reproducible steps)

Expected outputs and how to verify

Notes, tips & caveats

Troubleshooting

Small reproducibility checklist (one-liner)

Contact & license

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages