Code for Papar "BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks".
In this paper, we focus on black-box defense for VLMs against jailbreak attacks.Existing black-box defense methods are either unimodal or bimodal. Unimodal methods enhance either the vision or language module of the VLM, while bimodal methods robustify the model through text-image representation realignment. However, these methods suffer from two limitations: 1) they fail to fully exploit the cross-modal information, or 2) they degrade the model performance on benign inputs. To address these limitations, we propose a novel blue-team method BlueSuffix that defends target VLMs against jailbreak attacks without compromising its performance under black-box setting. BlueSuffix includes three key components: 1) a visual purifier against jailbreak images, 2) a textual purifier against jailbreak texts, and 3) a blue-team suffix generator using reinforcement fine-tuning for enhancing cross-modal robustness. We empirically show on four VLMs (LLaVA, MiniGPT-4, InstructionBLIP, and Gemini) and four safety benchmarks (Harmful Instruction, AdvBench, MM-SafetyBench, and RedTeam-2K) that BlueSuffix outperforms the baseline defenses by a significant margin. Our BlueSuffix opens up a promising direction for defending VLMs against jailbreak attacks.
Python 3.11.5
pip install -r requirements.txt
Please following the offcial guidelines.
[1]LLaVA: https://github.com/haotian-liu/LLaVA
[2]MiniGPT: https://github.com/Vision-CAIR/MiniGPT-4
cd Image_Purifier
python image_purifier.py --config configs/diffpure.yml --log_dir path/to/your/fold
The code of Image Purifier is based on DiffPure.
python text_purifier.py
The Text Purifier needs to call the GPT API.
Before fine-tuning, deploying target VLM is needed.
python ppo_blue_team.py
The fine-tuned Suffix Generator can be found in this Google Drive link.
For the current fine-tuned suffix generator, we adopted the default parameters to demonstrate the general applicability of our approach. To further improve the performance and stability of the suffix generator, one could systematically explore the reference policy and the hyperparameter β.
🌟 If you find it helpful, please star this repository and cite our research:
@inproceedings{zhao2025bluesuffix,
title={BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks},
author={Yunhan Zhao and Xiang Zheng and Lin Luo and Yige Li and Xingjun Ma and Yu-Gang Jiang},
booktitle={ICLR},
year={2025}
}In case of any questions, bugs, or suggestions, please feel free to open an issue.

