Do Unlearning Methods Remove Information from Language Model Weights?

11 October 2024

Papers citing "Do Unlearning Methods Remove Information from Language Model Weights?"

8 / 8 papers shown

Title
$SAEs $\textit{Can}$ Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs$ SAEs $\textit{Can}$ Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs Aashiq Muhamed Jacopo Bonato Mona Diab Virginia Smith MU 37 0 0 11 Apr 2025
Not All Data Are Unlearned Equally Aravind Krishnan Siva Reddy Marius Mosbach MU 39 0 0 07 Apr 2025
Exact Unlearning of Finetuning Data via Model Merging at Scale Kevin Kuo Amrith Rajagopal Setlur Kartik Srinivas Aditi Raghunathan Virginia Smith MoMe CLL MU 40 0 0 06 Apr 2025
A General Framework to Enhance Fine-tuning-based LLM Unlearning J. Ren Zhenwei Dai X. Tang Hui Liu Jingying Zeng ... R. Goutam Suhang Wang Yue Xing Qi He Hui Liu MU 91 1 0 25 Feb 2025
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities Zora Che Stephen Casper Robert Kirk Anirudh Satheesh Stewart Slocum ... Zikui Cai Bilal Chughtai Y. Gal Furong Huang Dylan Hadfield-Menell MU AAML ELM 66 2 0 03 Feb 2025
Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization Phillip Guo Aaquib Syed Abhay Sheshadri Aidan Ewart Gintare Karolina Dziugaite KELM MU 19 5 0 16 Oct 2024
An Adversarial Perspective on Machine Unlearning for AI Safety Jakub Łucki Boyi Wei Yangsibo Huang Peter Henderson F. Tramèr Javier Rando MU AAML 54 31 0 26 Sep 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 49 18 0 02 Jul 2024