v1v2 (latest)

The Great Contradiction Showdown: How Jailbreak and Stealth Wrestle in Vision-Language Models?

2 October 2024

Main:8 Pages

11 Figures

Bibliography:4 Pages

14 Tables

Appendix:10 Pages

Abstract

Vision-Language Models (VLMs) have achieved remarkable performance on a variety of tasks, yet they remain vulnerable to jailbreak attacks that compromise safety and reliability. In this paper, we provide an information-theoretic framework for understanding the fundamental trade-off between the effectiveness of these attacks and their stealthiness. Drawing on Fano's inequality, we demonstrate how an attacker's success probability is intrinsically linked to the stealthiness of generated prompts. Building on this, we propose an efficient algorithm for detecting non-stealthy jailbreak attacks, offering significant improvements in model robustness. Experimental results highlight the tension between strong attacks and their detectability, providing insights into both adversarial strategies and defense mechanisms.

View on arXiv

Comments on this paper