474
v1v2v3 (latest)

E2^2AT: Multimodal Jailbreak Defense via Dynamic Joint Optimization for Multimodal Large Language Models

Main:12 Pages
4 Figures
Bibliography:2 Pages
12 Tables
Abstract

Research endeavors have been made in learning robust Multimodal Large Language Models (MLLMs) against jailbreak attacks. However, existing methods for improving MLLMs' robustness still face critical challenges: \ding{172} how to efficiently tune massive weight parameters and \ding{173} how to ensure robustness against attacks across both visual and textual modalities. To this end, we propose an \textbf{E}fficient \textbf{E}nd-to-end \textbf{A}dversarial \textbf{T}raining (E2^2AT) framework for both visual and textual adversarial attacks. Specifically, for the visual aspect, E2^2AT incorporates an efficient projector-based AT module that aligns the attack samples at the feature level. For training objectives, we propose a Dynamic Joint Multimodal Optimization (DJMO) strategy to enhance generalization ability against jailbreak attacks by dynamically adjusting weights between normal and adversarial objectives. Extensive experiments are conducted with five major jailbreak attack methods across three mainstream MLLMs. Results demonstrate that our E2^2AT achieves the state-of-the-art performance, outperforming existing baselines by an average margin of 34\% across text and image modalities, while maintaining clean task performance. Furthermore, evaluations of real-world embodied intelligent systems highlight the practical applicability of E2^2AT, paving the way for the development of more secure and reliable multimodal systems. Our code is available on \href{this https URL}{\textcolor{red}{this https URL\_568}}.

View on arXiv
Comments on this paper