An Embarrassingly Simple Defense Against LLM Abliteration Attacks

25 May 2025

Harethah Shairah

Hasan Hammoud

Bernard Ghanem

G. Turkiyyah

ArXiv (abs)PDF HTML HuggingFace (5 upvotes)

Main:8 Pages

6 Figures

Bibliography:3 Pages

4 Tables

Appendix:1 Pages

Abstract

Large language models (LLMs) are typically aligned to comply with safety guidelines by refusing harmful instructions. A recent attack, termed abliteration, isolates and suppresses the single latent direction most responsible for refusal behavior, enabling the model to generate unethical content. We propose a defense that modifies how models generate refusals. We construct an extended-refusal dataset that contains harmful prompts with a full response that justifies the reason for refusal. We then fine-tune Llama-2-7B-Chat and Qwen2.5-Instruct (1.5B and 3B parameters) on our extended-refusal dataset, and evaluate the resulting systems on a set of harmful prompts. In our experiments, extended-refusal models maintain high refusal rates, dropping at most by 10%, whereas baseline models' refusal rates drop by 70-80% after abliteration. A broad evaluation of safety and utility shows that extended-refusal fine-tuning neutralizes the abliteration attack while preserving general performance.

View on arXiv

Comments on this paper