Semantic Representation Attack against Aligned Large Language Models

18 September 2025

Jiawei Lian

ArXiv (abs)PDF HTML Github

Main:10 Pages

3 Figures

Bibliography:5 Pages

15 Tables

Appendix:22 Pages

Abstract

Large Language Models (LLMs) increasingly employ alignment techniques to prevent harmful outputs. Despite these safeguards, attackers can circumvent them by crafting prompts that induce LLMs to generate harmful content.

View on arXiv

Comments on this paper