Crab: Multi layer contrastive supervision to improve speech emotion recognition under both prompted and natural conditions
Official implementation of the CRAB paper
CRAB is a Speech Emotion Recognition system based on Contrastive Representation and Multimodal Aligned Bottleneck — a framework that leverages contrastive learning and multimodal alignment to build robust emotional representations from speech. It is based on a Bi-modal Cross-Modal Transformer architecture on top of WavLM and RoBERTA features. It employs Multi Positive Contrastive Learning (MPCL) loss at different layers of the model to improve speech emotion recognition.
We provide a setup script that assumes a Conda installation. It will automatically create a new environment named crab and install all dependencies.
sh make_crab_env.shAlternatively, you can install dependencies directly:
pip install -r requirements.txtcrab/
├── bin/ # Training and inference scripts
├── src/ # Main source code
└── recipes/
└── {dataset}/ # Dataset-specific recipes for training and inference
bin/— Entry-point scripts for launching training and running inference.src/— Core model architecture, data loaders, and utilities.recipes/— Ready-to-use configurations for supported datasets. Use the provided examples as a starting point to adapt CRAB to your own dataset.
We provide SLURM-ready scripts for HPC environments inside the recipes/ folder.
Navigate to the corresponding recipe folder and submit the job:
cd recipes/{dataset}
sbatch train_crab.sh
sbatch test_crab.shEach experiment will automatically create an experiment folder containing all corresponding logging files and checkpoints.
Citation coming soon — paper under review.
@article{ueda2026crab,
title = {Crab: Multi layer contrastive supervision to improve speech emotion recognition under both prompted and natural conditions},
year = {2026},
author = {Ueda, Lucas H., Lima, João G.T., Costa, Paula D.P.},
note = {Coming soon}
}