Skip to content

Code for our paper——IDEATOR: Jailbreaking Large Vision-Language Models Using Themselves

Notifications You must be signed in to change notification settings

roywang021/IDEATOR

Repository files navigation

IDEATOR

Code for our paper: IDEATOR: Jailbreaking Large Vision-Language Models Using Themselves

image The IDEATOR framework leverages large Vision-Language Models (VLMs) as powerful red team models to autonomously generate malicious multimodal prompts for black-box jailbreak attacks. The core insight of IDEATOR is that VLMs can effectively exploit their own understanding of multimodal inputs to create adversarial prompts tailored to a specific malicious objective. Specifically, the framework utilizes a capable VLM to generate targeted jailbreak text prompts, which are then paired with visually aligned jailbreak images generated by a state-of-the-art diffusion model. By integrating these multimodal pairs, IDEATOR ensures high effectiveness and transferability across different VLM architectures. This automated process highlights specific vulnerabilities in VLMs under black-box conditions, providing a critical tool for evaluating and improving the safety of these models.

image

Basic Setup

  1. Prepare the pretrained weights for MiniGPT-4 (Vicuna-13B v0): Please refer to the guide from the MiniGPT-4 repository to get the weights of Vicuna. Then, set the path to the vicuna weight in the model config file here. Get MiniGPT-4 (the 13B version) checkpoint: download from here. Then, set the path to the pretrained checkpoint in the minigpt4_eval.yaml.

Use MiniGPT-4 to jailbreak

python ideator_attack_minigpt4.py --cfg-path minigpt4_eval.yaml  --gpu-id 0

Use Gemini to jailbreak (demo)

We found that stronger base models significantly increase jailbreaking success rates. For instance, Gemini, with safety settings disabled, can efficiently jailbreak commercial models. As an example, we used Gemini combined with Stable Diffusion 3.5 Large to achieve a 46% success rate in jailbreaking GPT-4o. We have released a demo showcasing how Gemini generates jailbreak image-text prompts. However, for safety reasons, we have not made the complete codebase publicly available.

python ideator_attack_gemini_demo.py

About

Code for our paper——IDEATOR: Jailbreaking Large Vision-Language Models Using Themselves

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages