367

Collapse of Irrelevant Representations (CIR) Ensures Robust and Non-Disruptive LLM Unlearning

Main:9 Pages
8 Figures
Bibliography:4 Pages
2 Tables
Appendix:3 Pages
Abstract

Current unlearning techniques and safety training consistently fail to remove dangerous knowledge from language models. We analyze the root causes and propose a highly selective technique which unlearns robustly and without disrupting general performance.

View on arXiv
Comments on this paper