Classifier-free guidance in LLMs Safety

8 December 2024

Roman Smirnov

ArXiv (abs)PDF HTML

Main:10 Pages

4 Figures

3 Tables

Appendix:2 Pages

Abstract

The paper describes LLM unlearning without a retaining dataset, using the ORPO reinforcement learning method with inference enhanced by modified classifier-free guidance. Significant improvement in unlearning, without degradation of the model, is achieved through direct training on synthetic replacement data in CFG-aware training regime, with classifier-free guidance applied during the inference. This article is an extended version of the NeurIPS 2024 LLM-PC submission, which was awarded second prize.

View on arXiv

Comments on this paper