40
0

NeuRel-Attack: Neuron Relearning for Safety Disalignment in Large Language Models

Abstract

Safety alignment in large language models (LLMs) is achieved through fine-tuning mechanisms that regulate neuron activations to suppress harmful content. In this work, we propose a novel approach to induce disalignment by identifying and modifying the neurons responsible for safety constraints. Our method consists of three key steps: Neuron Activation Analysis, where we examine activation patterns in response to harmful and harmless prompts to detect neurons that are critical for distinguishing between harmful and harmless inputs; Similarity-Based Neuron Identification, which systematically locates the neurons responsible for safe alignment; and Neuron Relearning for Safety Removal, where we fine-tune these selected neurons to restore the model's ability to generate previously restricted responses. Experimental results demonstrate that our method effectively removes safety constraints with minimal fine-tuning, highlighting a critical vulnerability in current alignment techniques. Our findings underscore the need for robust defenses against adversarial fine-tuning attacks on LLMs.

View on arXiv
@article{zhou2025_2504.21053,
  title={ NeuRel-Attack: Neuron Relearning for Safety Disalignment in Large Language Models },
  author={ Yi Zhou and Wenpeng Xing and Dezhang Kong and Changting Lin and Meng Han },
  journal={arXiv preprint arXiv:2504.21053},
  year={ 2025 }
}
Comments on this paper