ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2412.13705
70
0

Mitigating Adversarial Attacks in LLMs through Defensive Suffix Generation

18 December 2024
Minkyoung Kim
Yunha Kim
Hyeram Seo
Heejung Choi
Jiye Han
Gaeun Kee
Soyoung Ko
Hyoje Jung
Byeolhee Kim
Young-Hak Kim
Sanghyun Park
Tae Joon Jun
    AAML
ArXivPDFHTML
Abstract

Large language models (LLMs) have exhibited outstanding performance in natural language processing tasks. However, these models remain susceptible to adversarial attacks in which slight input perturbations can lead to harmful or misleading outputs. A gradient-based defensive suffix generation algorithm is designed to bolster the robustness of LLMs. By appending carefully optimized defensive suffixes to input prompts, the algorithm mitigates adversarial influences while preserving the models' utility. To enhance adversarial understanding, a novel total loss function (LtotalL_{\text{total}}Ltotal​) combining defensive loss (LdefL_{\text{def}}Ldef​) and adversarial loss (LadvL_{\text{adv}}Ladv​) generates defensive suffixes more effectively. Experimental evaluations conducted on open-source LLMs such as Gemma-7B, mistral-7B, Llama2-7B, and Llama2-13B show that the proposed method reduces attack success rates (ASR) by an average of 11\% compared to models without defensive suffixes. Additionally, the perplexity score of Gemma-7B decreased from 6.57 to 3.93 when applying the defensive suffix generated by openELM-270M. Furthermore, TruthfulQA evaluations demonstrate consistent improvements with Truthfulness scores increasing by up to 10\% across tested configurations. This approach significantly enhances the security of LLMs in critical applications without requiring extensive retraining.

View on arXiv
Comments on this paper