196
v1v2 (latest)

Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods

Main:5 Pages
6 Figures
Bibliography:3 Pages
17 Tables
Appendix:11 Pages
Abstract

In this work, we demonstrate that certain machine unlearning methods may fail under straightforward prompt attacks. We systematically evaluate eight unlearning techniques across three model families using output-based, logit-based, and probe analysis to assess the extent to which supposedly unlearned knowledge can be retrieved. While methods like RMU and TAR exhibit robust unlearning, ELM remains vulnerable to specific prompt attacks (e.g., prepending Hindi filler text to the original prompt recovers 57.3% accuracy). Our logit analysis further indicates that unlearned models are unlikely to hide knowledge through changes in answer formatting, given the strong correlation between output and logit accuracy. These findings challenge prevailing assumptions about unlearning effectiveness and highlight the need for evaluation frameworks that can reliably distinguish between genuine knowledge removal and superficial output suppression. To facilitate further research, we publicly release our evaluation framework to easily evaluate prompting techniques to retrieve unlearned knowledge.

View on arXiv
Comments on this paper