Semantic Representation Attack against Aligned Large Language Models

Main:10 Pages
3 Figures
Bibliography:5 Pages
15 Tables
Appendix:22 Pages
Abstract
Large Language Models (LLMs) increasingly employ alignment techniques to prevent harmful outputs. Despite these safeguards, attackers can circumvent them by crafting prompts that induce LLMs to generate harmful content.
View on arXivComments on this paper
