Human-Readable Adversarial Prompts: An Investigation into LLM Vulnerabilities Using Situational Context

20 December 2024

Abstract

Previous studies that uncovered vulnerabilities in large language models (LLMs) frequently employed nonsensical adversarial prompts. However, such prompts can now be readily identified using automated detection techniques. To further strengthen adversarial attacks, we focus on human-readable adversarial prompts, which are more realistic and potent threats. Our key contributions are (1) situation-driven attacks leveraging movie scripts as context to create human-readable prompts that successfully deceive LLMs, (2) adversarial suffix conversion to transform nonsensical adversarial suffixes into independent meaningful text, and (3) AdvPrompter with p-nucleus sampling, a method to generate diverse, human-readable adversarial suffixes, improving attack efficacy in models like GPT-3.5 and Gemma 7B.

View on arXiv

@article{das2025_2412.16359,
  title={ Human-Readable Adversarial Prompts: An Investigation into LLM Vulnerabilities Using Situational Context },
  author={ Nilanjana Das and Edward Raff and Manas Gaur },
  journal={arXiv preprint arXiv:2412.16359},
  year={ 2025 }
}

Comments on this paper