Image captioning composed of 3 modules: 1) a decoder only language model (OPT) for generating text, 2) a vision-language model CLIP for aligned representation of images and texts, 3) a embeddings mapper that maps CLIP embeddings to k OPT word embeddings.
Some examples from coco dataset, after training for 2 epochs only while learning a prefix of length 10 (k=10):
- create python 3.12 env
conda create -n capincho python=3.12
- install cuda toolkit 11.8
conda install conda-forge::cudatoolkit
- install pytorch compatible with cuda
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
- install requirements
pip install -r requirements.txt
- install radgraph compatible with python 3.12, for mimic evaluation
conda install git
pip install git+https://github.com/aehrc/radgraph.git
check the following files:
extractFeatures.py to extract the features vectors from coco dataset using CLIP or open CLIP.
trainDecoder.py to train the mapper module and finetune OPT, or a OPT LoRA model.
evaluateCaptioning.py to qualitative evaluate results.

