Skip to content

MaikeZuefle/contr-pretraining

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Contrastive Learning for Task-Independent SpeechLLM-Pretraining

Large language models (LLMs) excel in natural language processing but adapting these LLMs to speech processing tasks efficiently is not straightforward. Direct task-specific fine-tuning is limited by overfitting risks, data requirements, and computational costs. To address these challenges, we propose a scalable, two-stage training approach: (1) A task-independent speech pretraining stage using contrastive learning to align text and speech representations over all layers, followed by (2) a task-specific fine-tuning stage requiring minimal data. This approach outperforms traditional ASR pretraining and enables the model to surpass models specialized on speech translation and question answering while being trained on only 10% of the task-specific data.

An overview of our approach can be seen below.

Alt Text

This repository contains the code used for this project. The codebase started as a fork of LLaVA. Please see here for the main contributers. The codebase was further developed by the meetween team.

Contents

Install

  1. Clone this repository and navigate to the contr-pretraining folder
git clone 
cd contr-pretraining
  1. Install Dependencies
conda create -n llava-contr python=3.10 -y
conda activate llava-contr
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
pip install flash-attn==2.5.9.post1 --no-build-isolation # install flash-attn

Data

We use different datasets in this project, listed below.

Our code expects these datasets to be in parquet format. The data is then loaded with llava/dataset/custom_dataset. The data path, where the data is stored, needs to be added to scripts/data_configs.

Codebase

Training Scripts

Example pretraining and finetuning scripts can be found in scripts/pretrain and scripts/finetune. They can be launched with

bash scripts/pretrain/example_script.sh

Inference Scripts

For inference, please run the following command:

python llava/eval/eval.py \
                --model-path path-to-your-model-folder \
                --dataset  path-to-the-data-config/config.yml \ # examples in llava/eval/data_configs
                --data-dir data-path \
                --results-dir results-directory-path \
                --batch-size 2 \ # change to your needs
                --tokenizer-padding-side left \
                --from-yml 

The model

Our model uses a HuBERT encoder, a Q-Former projector and Llama-3.1-8B-instruct. However, in this work, we only train the projector, and consequently only the projector gets saved after training.

The checkpoint folder name should start with llava-. Otherwise, it can't be loaded correctly. This is a behaviour from the original LLaVA codebase.

Mixed-Speech-Text Inputs

To obtain mixed-speech-text inputs, one needs to run the asr-pretrainig script and add the --audio_nwp flag. This will then create the mixed speech-text-input data, without starting the training. The training can be started with scripts/pretrain/pretrain-mixed-nwp.sh, after the data has been created.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors