BlueSuffix

Code for Papar "BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks".

📌 Abstract

In this paper, we focus on black-box defense for VLMs against jailbreak attacks.Existing black-box defense methods are either unimodal or bimodal. Unimodal methods enhance either the vision or language module of the VLM, while bimodal methods robustify the model through text-image representation realignment. However, these methods suffer from two limitations: 1) they fail to fully exploit the cross-modal information, or 2) they degrade the model performance on benign inputs. To address these limitations, we propose a novel blue-team method BlueSuffix that defends target VLMs against jailbreak attacks without compromising its performance under black-box setting. BlueSuffix includes three key components: 1) a visual purifier against jailbreak images, 2) a textual purifier against jailbreak texts, and 3) a blue-team suffix generator using reinforcement fine-tuning for enhancing cross-modal robustness. We empirically show on four VLMs (LLaVA, MiniGPT-4, InstructionBLIP, and Gemini) and four safety benchmarks (Harmful Instruction, AdvBench, MM-SafetyBench, and RedTeam-2K) that BlueSuffix outperforms the baseline defenses by a significant margin. Our BlueSuffix opens up a promising direction for defending VLMs against jailbreak attacks.

Usage

Requirements

Python 3.11.5

pip install -r requirements.txt

Model depolyment

Please following the offcial guidelines.

[1]LLaVA: https://github.com/haotian-liu/LLaVA

[2]MiniGPT: https://github.com/Vision-CAIR/MiniGPT-4

Image Purifier

cd Image_Purifier

python image_purifier.py --config configs/diffpure.yml --log_dir path/to/your/fold

The code of Image Purifier is based on DiffPure.

Text Purifier

python text_purifier.py

The Text Purifier needs to call the GPT API.

Suffix Generator

Before fine-tuning, deploying target VLM is needed.

python ppo_blue_team.py

The fine-tuned Suffix Generator can be found in this Google Drive link.

For the current fine-tuned suffix generator, we adopted the default parameters to demonstrate the general applicability of our approach. To further improve the performance and stability of the suffix generator, one could systematically explore the reference policy and the hyperparameter β.

Citing BlueSuffix

🌟 If you find it helpful, please star this repository and cite our research:

@inproceedings{zhao2025bluesuffix,
title={BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks},
author={Yunhan Zhao and Xiang Zheng and Lin Luo and Yige Li and Xingjun Ma and Yu-Gang Jiang},
booktitle={ICLR},
year={2025}
}

In case of any questions, bugs, or suggestions, please feel free to open an issue.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
Image_Purifier		Image_Purifier
LLaVA		LLaVA
README.md		README.md
framework.jpg		framework.jpg
generator.jpg		generator.jpg
ppo_blue_team.py		ppo_blue_team.py
requirements.txt		requirements.txt
text_purifier.py		text_purifier.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BlueSuffix

📌 Abstract

Usage

Requirements

Model depolyment

Image Purifier

Text Purifier

Suffix Generator

Citing BlueSuffix

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BlueSuffix

📌 Abstract

Usage

Requirements

Model depolyment

Image Purifier

Text Purifier

Suffix Generator

Citing BlueSuffix

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages