Code for our paper: IDEATOR: Jailbreaking Large Vision-Language Models Using Themselves
The IDEATOR framework leverages large Vision-Language Models (VLMs) as powerful red team models to autonomously generate malicious multimodal prompts for black-box jailbreak attacks. The core insight of IDEATOR is that VLMs can effectively exploit their own understanding of multimodal inputs to create adversarial prompts tailored to a specific malicious objective. Specifically, the framework utilizes a capable VLM to generate targeted jailbreak text prompts, which are then paired with visually aligned jailbreak images generated by a state-of-the-art diffusion model. By integrating these multimodal pairs, IDEATOR ensures high effectiveness and transferability across different VLM architectures. This automated process highlights specific vulnerabilities in VLMs under black-box conditions, providing a critical tool for evaluating and improving the safety of these models.
- Prepare the pretrained weights for MiniGPT-4 (Vicuna-13B v0): Please refer to the guide from the MiniGPT-4 repository to get the weights of Vicuna. Then, set the path to the vicuna weight in the model config file here. Get MiniGPT-4 (the 13B version) checkpoint: download from here. Then, set the path to the pretrained checkpoint in the minigpt4_eval.yaml.
python ideator_attack_minigpt4.py --cfg-path minigpt4_eval.yaml --gpu-id 0
We found that stronger base models significantly increase jailbreaking success rates. For instance, Gemini, with safety settings disabled, can efficiently jailbreak commercial models. As an example, we used Gemini combined with Stable Diffusion 3.5 Large to achieve a 46% success rate in jailbreaking GPT-4o. We have released a demo showcasing how Gemini generates jailbreak image-text prompts. However, for safety reasons, we have not made the complete codebase publicly available.
python ideator_attack_gemini_demo.py