Effective Black-Box Multi-Faceted Attacks Breach Vision Large Language Model Guardrails

9 February 2025

Abstract

Vision Large Language Models (VLLMs) integrate visual data processing, expanding their real-world applications, but also increasing the risk of generating unsafe responses. In response, leading companies have implemented Multi-Layered safety defenses, including alignment training, safety system prompts, and content moderation. However, their effectiveness against sophisticated adversarial attacks remains largely unexplored. In this paper, we propose MultiFaceted Attack, a novel attack framework designed to systematically bypass Multi-Layered Defenses in VLLMs. It comprises three complementary attack facets: Visual Attack that exploits the multimodal nature of VLLMs to inject toxic system prompts through images; Alignment Breaking Attack that manipulates the model's alignment mechanism to prioritize the generation of contrasting responses; and Adversarial Signature that deceives content moderators by strategically placing misleading information at the end of the response. Extensive evaluations on eight commercial VLLMs in a black-box setting demonstrate that MultiFaceted Attack achieves a 61.56% attack success rate, surpassing state-of-the-art methods by at least 42.18%.

View on arXiv

@article{yang2025_2502.05772,
  title={ Effective Black-Box Multi-Faceted Attacks Breach Vision Large Language Model Guardrails },
  author={ Yijun Yang and Lichao Wang and Xiao Yang and Lanqing Hong and Jun Zhu },
  journal={arXiv preprint arXiv:2502.05772},
  year={ 2025 }
}

Comments on this paper