221

NeuroStrike: Neuron-Level Attacks on Aligned LLMs

Main:13 Pages
8 Figures
Bibliography:3 Pages
13 Tables
Appendix:4 Pages
Abstract

Safety alignment is critical for the ethical deployment of large language models (LLMs), guiding them to avoid generating harmful or unethical content. Current alignment techniques, such as supervised fine-tuning and reinforcement learning from human feedback, remain fragile and can be bypassed by carefully crafted adversarial prompts. Unfortunately, such attacks rely on trial and error, lack generalizability across models, and are constrained by scalability and reliability.

View on arXiv
Comments on this paper