An Embarrassingly Simple Defense Against LLM Abliteration Attacks

25 May 2025

Papers citing "An Embarrassingly Simple Defense Against LLM Abliteration Attacks"

2 / 2 papers shown

Title
Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection Harethah Shairah Hasan Hammoud G. Turkiyyah Bernard Ghanem LLMSV 36 0 0 28 Aug 2025
Towards Safer Pretraining: Analyzing and Filtering Harmful Content in Webscale datasets for Responsible LLMs Sai Krishna Mendu Harish Yenala Aditi Gulati Shanu Kumar Parag Agrawal 191 4 0 04 May 2025