TL;DR. To address the negation issue in CLIP, we propose data generation pipelines and validate their effectiveness using public and novel benchmarks. Our approach enhances negation awareness and extends to diverse multimodal tasks.
- ✅ Preprint available on arXiv
- ✅ NegationCLIP checkpoints available on Hugging Face
- ✅ Data generation & fine-tuning scripts included
- ✅ NegRefCOCOg Benchmarks
git clone https://github.com/jerryray/NegationCLIP.git
cd NegationCLIP
conda env create -f environment.yml
conda activate negationclipThe script generates captions with explicit negation from COCO using LLaMA-3 and LLaVA-v1.6-Mistral-7B.
python src/data_generation.py \
--caption_path /path/to/COCO/captions_train2014.json \
--image_dir /path/to/COCO/train2014 \
--output_dir ./outputOptions
--use_random_object: randomly select absent objects (instead of contextual ones)
Fine-tune the text encoder of CLIP on negation-inclusive captions:
python src/clip_finetune.py \
--json_path ./annotations/negationclip_captions_train2014.json \
--image_dir /path/to/train2014 \
--output_dir ./checkpoints \
--clip_model "ViT-B/32" \Outputs
- Best model automatically saved when validation loss improves.
Evaluate NegationCLIP on the NegRefCOCOg benchmark:
cd NegRefCOCOg
python negrefcocog_eval.py \
--arch "ViT-B/16" \
--load_dir /path/to/checkpoint.pt \
--device "cuda:1" \
--annotation_file "NegRefCOCOg.json" \
--image_dir /path/to/coco_images/train2014negationclip/
├── src/
│ ├── clip_finetune.py
│ └── data_generation.py
├── annotations/
│ └── negationclip_captions_train2014.json
├── NegRefCOCOg/
│ ├── negrefcocog_eval.py
│ ├── refer.py
│ ├── NegRefCOCOg.json
│ └── external/
├── requirements.txt
├── environment.yml
├── README.md
└── LICENSE
- Hugging Face: jerryray/negationclip
- Model Type: Fine-tuned CLIP (ViT-B/32, ViT-B/16, ViT-L/14, ViT-L/14@336px)