LEACE: Perfect linear concept erasure in closed form

6 June 2023

Nora Belrose

David Schneider-Joseph

Papers citing "LEACE: Perfect linear concept erasure in closed form"

41 / 91 papers shown

Title
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning Nathaniel Li Alexander Pan Anjali Gopal Summer Yue Daniel Berrios ... Yan Shoshitaishvili Jimmy Ba K. Esvelt Alexandr Wang Dan Hendrycks ELM 32 130 0 05 Mar 2024
AtP*: An efficient and scalable method for localizing LLM behaviour to components János Kramár Tom Lieberum Rohin Shah Neel Nanda KELM 31 40 0 01 Mar 2024
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations Jing-ling Huang Zhengxuan Wu Christopher Potts Mor Geva Atticus Geiger 43 24 0 27 Feb 2024
Immunization against harmful fine-tuning attacks Domenic Rosati Jan Wehner Kai Williams Lukasz Bartoszcze Jan Batzner Hassan Sajjad Frank Rudzicz AAML 26 15 0 26 Feb 2024
CausalGym: Benchmarking causal interpretability methods on linguistic tasks Aryaman Arora Daniel Jurafsky Christopher Potts 37 18 0 19 Feb 2024
Representation Surgery: Theory and Practice of Affine Steering Shashwat Singh Shauli Ravfogel Jonathan Herzig Roee Aharoni Ryan Cotterell Ponnurangam Kumaraguru LLMSV 19 12 0 15 Feb 2024
Suppressing Pink Elephants with Direct Principle Feedback Louis Castricato Nathan Lile Suraj Anand Hailey Schoelkopf Siddharth Verma Stella Biderman 50 9 0 12 Feb 2024
Explaining Text Classifiers with Counterfactual Representations Pirmin Lemberger Antoine Saillenfest 29 0 0 01 Feb 2024
A Comprehensive Study of Knowledge Editing for Large Language Models Ningyu Zhang Yunzhi Yao Bo Tian Peng Wang Shumin Deng ... Lei Liang Zhiqiang Zhang Xiao-Jun Zhu Jun Zhou Huajun Chen KELM 13 73 0 02 Jan 2024
Improving Activation Steering in Language Models with Mean-Centring Ole Jorgensen Dylan R. Cope Nandi Schoots Murray Shanahan LLMSV 8 27 0 06 Dec 2023
The Ethics of Automating Legal Actors Josef Valvoda Alec Thompson Ryan Cotterell Simone Teufel AILaw ELM 11 1 0 01 Dec 2023
Fuse to Forget: Bias Reduction and Selective Memorization through Model Fusion Kerem Zaman Leshem Choshen Shashank Srivastava KELM MoMe 11 10 0 13 Nov 2023
Uncovering Intermediate Variables in Transformers using Circuit Probing Michael A. Lepori Thomas Serre Ellie Pavlick 49 7 0 07 Nov 2023
Debiasing Algorithm through Model Adaptation Tomasz Limisiewicz David Marecek Tomáš Musil 11 12 0 29 Oct 2023
Knowledge Editing for Large Language Models: A Survey Song Wang Yaochen Zhu Haochen Liu Zaiyi Zheng Chen Chen Jundong Li KELM 66 118 0 24 Oct 2023
Identifying and Adapting Transformer-Components Responsible for Gender Bias in an English Language Model Abhijith Chintam Rahel Beloch Willem H. Zuidema Michael Hanna Oskar van der Wal 10 16 0 19 Oct 2023
Removing Spurious Concepts from Neural Network Representations via Joint Subspace Estimation Floris Holstege Bram Wouters Noud van Giersbergen C. Diks 13 1 0 18 Oct 2023
Emptying the Ocean with a Spoon: Should We Edit Models? Yuval Pinter Michael Elhadad KELM 12 26 0 18 Oct 2023
The Curious Case of Hallucinatory (Un)answerability: Finding Truths in the Hidden States of Over-Confident Large Language Models Aviv Slobodkin Omer Goldman Avi Caciularu Ido Dagan Shauli Ravfogel HILM LRM 23 20 0 18 Oct 2023
In-Context Unlearning: Language Models as Few Shot Unlearners Martin Pawelczyk Seth Neel Himabindu Lakkaraju MU 13 98 0 11 Oct 2023
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets Samuel Marks Max Tegmark HILM 83 164 0 10 Oct 2023
Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks Vaidehi Patil Peter Hase Mohit Bansal KELM AAML 5 90 0 29 Sep 2023
Large Language Model Alignment: A Survey Tianhao Shen Renren Jin Yufei Huang Chuang Liu Weilong Dong Zishan Guo Xinwei Wu Yan Liu Deyi Xiong LM&MA 6 169 0 26 Sep 2023
Sparse Autoencoders Find Highly Interpretable Features in Language Models Hoagy Cunningham Aidan Ewart Logan Riggs R. Huben Lee Sharkey MILM 12 289 0 15 Sep 2023
Benchmarks for Detecting Measurement Tampering Fabien Roger Ryan Greenblatt Max Nadeau Buck Shlegeris Nate Thomas 11 1 0 29 Aug 2023
A Geometric Notion of Causal Probing Clément Guerner Anej Svete Tianyu Liu Alex Warstadt Ryan Cotterell LLMSV 24 12 0 27 Jul 2023
Stay on topic with Classifier-Free Guidance Guillaume Sanchez Honglu Fan Alexander Spangher Elad Levi Pawan Sasanka Ammanamanchi Stella Biderman 3DV 23 45 0 30 Jun 2023
An Overview of Catastrophic AI Risks Dan Hendrycks Mantas Mazeika Thomas Woodside SILM 8 162 0 21 Jun 2023
Editing Large Language Models: Problems, Methods, and Opportunities Yunzhi Yao Peng Wang Bo Tian Shuyang Cheng Zhoubo Li Shumin Deng Huajun Chen Ningyu Zhang KELM 17 275 0 22 May 2023
Emergent and Predictable Memorization in Large Language Models Stella Biderman USVSN Sai Prashanth Lintang Sutawika Hailey Schoelkopf Quentin G. Anthony Shivanshu Purohit Edward Raf 11 110 0 21 Apr 2023
Computational modeling of semantic change Nina Tahmasebi Haim Dubossarsky 18 5 0 13 Apr 2023
Competence-Based Analysis of Language Models Adam Davies Jize Jiang Chengxiang Zhai ELM 13 4 0 01 Mar 2023
Log-linear Guardedness and its Implications Shauli Ravfogel Yoav Goldberg Ryan Cotterell 15 2 0 18 Oct 2022
Causal Conceptions of Fairness and their Consequences H. Nilforoshan Johann D. Gaebler Ravi Shroff Sharad Goel FaML 124 45 0 12 Jul 2022
Can Transformer be Too Compositional? Analysing Idiom Processing in Neural Machine Translation Verna Dankers Christopher G. Lucas Ivan Titov 25 28 0 30 May 2022
Linear Adversarial Concept Erasure Shauli Ravfogel Michael Twiton Yoav Goldberg Ryan Cotterell KELM 62 56 0 28 Jan 2022
All Bark and No Bite: Rogue Dimensions in Transformer Language Models Obscure Representational Quality William Timkey Marten van Schijndel 213 110 0 09 Sep 2021
The Rediscovery Hypothesis: Language Models Need to Meet Linguistics Vassilina Nikoulina Maxat Tezekbayev Nuradil Kozhakhmet Madina Babazhanova Matthias Gallé Z. Assylbekov 21 7 0 02 Mar 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe ... Horace He Anish Thite Noa Nabeshima Shawn Presser Connor Leahy AIMat 236 1,508 0 31 Dec 2020
On the Global Optima of Kernelized Adversarial Representation Learning Bashir Sadeghi Runyi Yu Vishnu Naresh Boddeti AAML 56 29 0 16 Oct 2019
Adversarial Deep Averaging Networks for Cross-Lingual Sentiment Classification Xilun Chen Yu Sun Ben Athiwaratkun Claire Cardie Kilian Q. Weinberger 203 315 0 06 Jun 2016