Collapse of Irrelevant Representations (CIR) Ensures Robust and Non-Disruptive LLM Unlearning
- MU

Main:9 Pages
8 Figures
Bibliography:4 Pages
2 Tables
Appendix:3 Pages
Abstract
Current unlearning techniques and safety training consistently fail to remove dangerous knowledge from language models. We analyze the root causes and propose a highly selective technique which unlearns robustly and without disrupting general performance.
View on arXivComments on this paper
