THIELD is a lightweight preprocessing framework that protects Large Vision-Language Models (LVLMs) from adversarial multimodal attacks through fine-grained safety classification and adaptive response policies.
Fine-Grained Classification: Analyzes text-image pairs across 45+ safety categories to detect concealed harmful intents in benign prompts.
Adaptive Response Policies: Three-tier system with explicit actions:
- BLOCK: Hard refusal for dangerous content (self-harm, violence)
- REFRAME: Safe redirection to educational alternatives
- FORWARD: Unmodified processing for benign requests
Model-Agnostic Design: Plug-and-play preprocessing that works with any LVLM without retraining - tested across LLaVA, LLaMA Vision, Qwen-VL, and others.
Across five benchmarks (FigStep, MMSafety, SIUO, AdvBench, FLowChart), THIELD consistently:
- Reduces jailbreak success rates
- Maintains model utility on benign tasks
- Incurs negligible computational overhead (<100ms per request)
- Extends easily to new attack vectors
pip install -r requirements.txt
# Generate responses
python my_scripts/scripts/generate.py generate --model llava-1.5 --dataset dataset/sampled.json
# Evaluate safety (requires OpenAI API key)
export EVAL_LM_MODEL=openai/gpt-5-mini
python my_scripts/scripts/evaluate.py --input results.json --mode threats- Models: LLaVA, LLaMA Vision, Qwen-VL, GPT
- Evaluation: Elite, StrongReject, Threats modes
- Datasets: FigStep, MMSafety, SIUO, AdvBench, FlowChart
# Basic generation
python my_scripts/scripts/generate.py generate --model llava-1.5 --dataset dataset/sampled.json
# Safety-filtered generation
python my_scripts/scripts/generate.py agentic --model llava-1.5 --dataset dataset/sampled.json
# Evaluation
python my_scripts/scripts/evaluate.py --input results.json --mode threats
# End-to-end pipeline
python my_scripts/scripts/pipeline.py --models llava-1.5 --dataset dataset/sampled.json --mode agentic --eval-mode threats- Python 3.8+, PyTorch 2.0+
- OpenAI API key for evaluation
- 8GB+ GPU memory