ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2410.20971
30
3

BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks

28 October 2024
Yunhan Zhao
Xiang Zheng
Lin Luo
Yige Li
Xingjun Ma
Yu-Gang Jiang
    VLM
    AAML
ArXivPDFHTML
Abstract

In this paper, we focus on black-box defense for VLMs against jailbreak attacks. Existing black-box defense methods are either unimodal or bimodal. Unimodal methods enhance either the vision or language module of the VLM, while bimodal methods robustify the model through text-image representation realignment. However, these methods suffer from two limitations: 1) they fail to fully exploit the cross-modal information, or 2) they degrade the model performance on benign inputs. To address these limitations, we propose a novel blue-team method BlueSuffix that defends target VLMs against jailbreak attacks without compromising its performance under black-box setting. BlueSuffix includes three key components: 1) a visual purifier against jailbreak images, 2) a textual purifier against jailbreak texts, and 3) a blue-team suffix generator using reinforcement fine-tuning for enhancing cross-modal robustness. We empirically show on four VLMs (LLaVA, MiniGPT-4, InstructionBLIP, and Gemini) and four safety benchmarks (Harmful Instruction, AdvBench, MM-SafetyBench, and RedTeam-2K) that BlueSuffix outperforms the baseline defenses by a significant margin. Our BlueSuffix opens up a promising direction for defending VLMs against jailbreak attacks. Code is available atthis https URL.

View on arXiv
@article{zhao2025_2410.20971,
  title={ BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks },
  author={ Yunhan Zhao and Xiang Zheng and Lin Luo and Yige Li and Xingjun Ma and Yu-Gang Jiang },
  journal={arXiv preprint arXiv:2410.20971},
  year={ 2025 }
}
Comments on this paper