ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.07987
91
0

Universal Adversarial Attack on Aligned Multimodal LLMs

11 February 2025
Temurbek Rahmatullaev
Polina Druzhinina
Matvey Mikhalchuk
Andrey Kuznetsov
Anton Razzhigaev
    AAML
ArXivPDFHTML
Abstract

We propose a universal adversarial attack on multimodal Large Language Models (LLMs) that leverages a single optimized image to override alignment safeguards across diverse queries and even multiple models. By backpropagating through the vision encoder and language head, we craft a synthetic image that forces the model to respond with a targeted phrase (e.g., ''Sure, here it is'') or otherwise unsafe content-even for harmful prompts. In experiments on the SafeBench benchmark, our method achieves significantly higher attack success rates than existing baselines, including text-only universal prompts (e.g., up to 93% on certain models). We further demonstrate cross-model transferability by training on several multimodal LLMs simultaneously and testing on unseen architectures. Additionally, a multi-answer variant of our approach produces more natural-sounding (yet still malicious) responses. These findings underscore critical vulnerabilities in current multimodal alignment and call for more robust adversarial defenses. We will release code and datasets under the Apache-2.0 license. Warning: some content generated by Multimodal LLMs in this paper may be offensive to some readers.

View on arXiv
@article{rahmatullaev2025_2502.07987,
  title={ Universal Adversarial Attack on Aligned Multimodal LLMs },
  author={ Temurbek Rahmatullaev and Polina Druzhinina and Matvey Mikhalchuk and Andrey Kuznetsov and Anton Razzhigaev },
  journal={arXiv preprint arXiv:2502.07987},
  year={ 2025 }
}
Comments on this paper