Skip to content

adaren100/THIELD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

THIELD: Model-Agnostic Safety Shield for Vision-Language Models

THIELD is a lightweight preprocessing framework that protects Large Vision-Language Models (LVLMs) from adversarial multimodal attacks through fine-grained safety classification and adaptive response policies.

Key Features

Fine-Grained Classification: Analyzes text-image pairs across 45+ safety categories to detect concealed harmful intents in benign prompts.

Adaptive Response Policies: Three-tier system with explicit actions:

  • BLOCK: Hard refusal for dangerous content (self-harm, violence)
  • REFRAME: Safe redirection to educational alternatives
  • FORWARD: Unmodified processing for benign requests

Model-Agnostic Design: Plug-and-play preprocessing that works with any LVLM without retraining - tested across LLaVA, LLaMA Vision, Qwen-VL, and others.

Performance

Across five benchmarks (FigStep, MMSafety, SIUO, AdvBench, FLowChart), THIELD consistently:

  • Reduces jailbreak success rates
  • Maintains model utility on benign tasks
  • Incurs negligible computational overhead (<100ms per request)
  • Extends easily to new attack vectors

Quick Start

pip install -r requirements.txt

# Generate responses
python my_scripts/scripts/generate.py generate --model llava-1.5 --dataset dataset/sampled.json

# Evaluate safety (requires OpenAI API key)
export EVAL_LM_MODEL=openai/gpt-5-mini
python my_scripts/scripts/evaluate.py --input results.json --mode threats

Features

  • Models: LLaVA, LLaMA Vision, Qwen-VL, GPT
  • Evaluation: Elite, StrongReject, Threats modes
  • Datasets: FigStep, MMSafety, SIUO, AdvBench, FlowChart

Commands

# Basic generation
python my_scripts/scripts/generate.py generate --model llava-1.5 --dataset dataset/sampled.json

# Safety-filtered generation
python my_scripts/scripts/generate.py agentic --model llava-1.5 --dataset dataset/sampled.json

# Evaluation
python my_scripts/scripts/evaluate.py --input results.json --mode threats

# End-to-end pipeline
python my_scripts/scripts/pipeline.py --models llava-1.5 --dataset dataset/sampled.json --mode agentic --eval-mode threats

Requirements

  • Python 3.8+, PyTorch 2.0+
  • OpenAI API key for evaluation
  • 8GB+ GPU memory

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors