NeuroStrike: Neuron-Level Attacks on Aligned LLMs
- AAML

Main:13 Pages
8 Figures
Bibliography:3 Pages
13 Tables
Appendix:4 Pages
Abstract
Safety alignment is critical for the ethical deployment of large language models (LLMs), guiding them to avoid generating harmful or unethical content. Current alignment techniques, such as supervised fine-tuning and reinforcement learning from human feedback, remain fragile and can be bypassed by carefully crafted adversarial prompts. Unfortunately, such attacks rely on trial and error, lack generalizability across models, and are constrained by scalability and reliability.
View on arXivComments on this paper
