Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2306.03819
Cited By
LEACE: Perfect linear concept erasure in closed form
6 June 2023
Nora Belrose
David Schneider-Joseph
Shauli Ravfogel
Ryan Cotterell
Edward Raff
Stella Biderman
KELM
MU
Re-assign community
ArXiv
PDF
HTML
Papers citing
"LEACE: Perfect linear concept erasure in closed form"
50 / 91 papers shown
Title
Quiet Feature Learning in Algorithmic Tasks
Prudhviraj Naidu
Zixian Wang
Leon Bergen
R. Paturi
VLM
39
0
0
06 May 2025
DetoxAI: a Python Toolkit for Debiasing Deep Learning Models in Computer Vision
Ignacy Stepka
Lukasz Sztukiewicz
Michał Wiliński
Jerzy Stefanowski
10
0
0
02 May 2025
Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation
Vaidehi Patil
Yi-Lin Sung
Peter Hase
Jie Peng
Tianlong Chen
Mohit Bansal
AAML
MU
77
3
0
01 May 2025
Probing then Editing Response Personality of Large Language Models
Tianjie Ju
Zhenyu Shao
B. Wang
Y. Chen
Zhuosheng Zhang
Hao Fei
M. Lee
W. Hsu
Sufeng Duan
Gongshen Liu
KELM
38
0
0
14 Apr 2025
Fundamental Limits of Perfect Concept Erasure
Somnath Basu Roy Chowdhury
Avinava Dubey
Ahmad Beirami
Rahul Kidambi
Nicholas Monath
Amr Ahmed
Snigdha Chaturvedi
51
0
0
25 Mar 2025
Controlled Model Debiasing through Minimal and Interpretable Updates
Federico Di Gennaro
Thibault Laugel
Vincent Grari
Marcin Detyniecki
FaML
42
0
0
28 Feb 2025
Analyzing the Inner Workings of Transformers in Compositional Generalization
Ryoma Kumon
Hitomi Yanaka
56
0
0
24 Feb 2025
Model Lakes
Koyena Pal
David Bau
Renée J. Miller
60
0
0
24 Feb 2025
The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence
Tom Wollschlager
Jannes Elstner
Simon Geisler
Vincent Cohen-Addad
Stephan Günnemann
Johannes Gasteiger
LLMSV
37
0
0
24 Feb 2025
Analyzing Fine-tuning Representation Shift for Multimodal LLMs Steering alignment
Pegah Khayatan
Mustafa Shukor
Jayneel Parekh
Matthieu Cord
LLMSV
30
1
0
06 Jan 2025
Representation in large language models
Cameron C. Yetman
33
1
0
03 Jan 2025
Concept-ROT: Poisoning Concepts in Large Language Models with Model Editing
Keltin Grimes
Marco Christiani
David Shriver
Marissa Connor
KELM
75
1
0
17 Dec 2024
Controllable Context Sensitivity and the Knob Behind It
Julian Minder
Kevin Du
Niklas Stoehr
Giovanni Monea
Chris Wendler
Robert West
Ryan Cotterell
KELM
31
3
0
11 Nov 2024
RESTOR: Knowledge Recovery through Machine Unlearning
Keivan Rezaei
Khyathi Raghavi Chandu
S. Feizi
Yejin Choi
Faeze Brahman
Abhilasha Ravichander
KELM
CLL
MU
43
0
0
31 Oct 2024
Focus On This, Not That! Steering LLMs With Adaptive Feature Specification
Tom A. Lamb
Adam Davies
Alasdair Paren
Philip H. S. Torr
Francesco Pinto
40
0
0
30 Oct 2024
Debiasing Large Vision-Language Models by Ablating Protected Attribute Representations
Neale Ratzlaff
Matthew Lyle Olson
Musashi Hinck
Shao-Yen Tseng
Vasudev Lal
Phillip Howard
20
0
0
17 Oct 2024
Improving Instruction-Following in Language Models through Activation Steering
Alessandro Stolfo
Vidhisha Balachandran
Safoora Yousefi
Eric Horvitz
Besmira Nushi
LLMSV
37
13
0
15 Oct 2024
LLM Unlearning via Loss Adjustment with Only Forget Data
Yaxuan Wang
Jiaheng Wei
Chris Liu
Jinlong Pang
Q. Liu
A. Shah
Yujia Bao
Yang Liu
Wei Wei
KELM
MU
24
3
0
14 Oct 2024
Robust AI-Generated Text Detection by Restricted Embeddings
Kristian Kuznetsov
Eduard Tulchinskii
Laida Kushnareva
German Magai
Serguei Barannikov
Sergey I. Nikolenko
Irina Piontkovskaya
DeLMO
25
3
0
10 Oct 2024
Unstable Unlearning: The Hidden Risk of Concept Resurgence in Diffusion Models
Vinith M. Suriyakumar
Rohan Alur
Ayush Sekhari
Manish Raghavan
Ashia C. Wilson
34
2
0
10 Oct 2024
OD-Stega: LLM-Based Near-Imperceptible Steganography via Optimized Distributions
Yu-Shin Huang
Peter Just
Krishna Narayanan
Chao Tian
16
2
0
06 Oct 2024
Optimal ablation for interpretability
Maximilian Li
Lucas Janson
FAtt
25
1
0
16 Sep 2024
Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations
Róbert Csordás
Christopher Potts
Christopher D. Manning
Atticus Geiger
GAN
21
10
0
20 Aug 2024
The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability
Aaron Mueller
Jannik Brinkmann
Millicent Li
Samuel Marks
Koyena Pal
...
Arnab Sen Sharma
Jiuding Sun
Eric Todd
David Bau
Yonatan Belinkov
CML
23
18
0
02 Aug 2024
Tamper-Resistant Safeguards for Open-Weight LLMs
Rishub Tamirisa
Bhrugu Bharathi
Long Phan
Andy Zhou
Alice Gatti
...
Andy Zou
Dawn Song
Bo Li
Dan Hendrycks
Mantas Mazeika
AAML
MU
22
36
0
01 Aug 2024
MUSE: Machine Unlearning Six-Way Evaluation for Language Models
Weijia Shi
Jaechan Lee
Yangsibo Huang
Sadhika Malladi
Jieyu Zhao
Ari Holtzman
Daogao Liu
Luke Zettlemoyer
Noah A. Smith
Chiyuan Zhang
MU
ELM
32
36
0
08 Jul 2024
Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks
Aaron Mueller
CML
21
8
0
05 Jul 2024
Machine Unlearning Fails to Remove Data Poisoning Attacks
Martin Pawelczyk
Jimmy Z. Di
Yiwei Lu
Gautam Kamath
Ayush Sekhari
Seth Neel
AAML
MU
36
7
0
25 Jun 2024
Towards a Science Exocortex
Kevin G. Yager
64
1
0
24 Jun 2024
Preference Tuning For Toxicity Mitigation Generalizes Across Languages
Xiaochen Li
Zheng-Xin Yong
Stephen H. Bach
CLL
23
11
0
23 Jun 2024
Protecting Privacy Through Approximating Optimal Parameters for Sequence Unlearning in Language Models
Dohyun Lee
Daniel Rim
Minseok Choi
Jaegul Choo
PILM
MU
49
3
0
20 Jun 2024
Refusal in Language Models Is Mediated by a Single Direction
Andy Arditi
Oscar Obeso
Aaquib Syed
Daniel Paleka
Nina Panickssery
Wes Gurnee
Neel Nanda
34
130
0
17 Jun 2024
In-Context Editing: Learning Knowledge from Self-Induced Distributions
Siyuan Qi
Bangcheng Yang
Kailin Jiang
Xiaobo Wang
Jiaqi Li
Yifan Zhong
Yaodong Yang
Zilong Zheng
KELM
67
8
0
17 Jun 2024
Exploring Safety-Utility Trade-Offs in Personalized Language Models
Anvesh Rao Vijjini
Somnath Basu Roy Chowdhury
Snigdha Chaturvedi
21
6
0
17 Jun 2024
RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models
Zhuoran Jin
Pengfei Cao
Chenhao Wang
Zhitao He
Hongbang Yuan
Jiachun Li
Yubo Chen
Kang Liu
Jun Zhao
KELM
MU
24
12
0
16 Jun 2024
On the Encoding of Gender in Transformer-based ASR Representations
Aravind Krishnan
Badr M. Abdullah
Dietrich Klakow
33
2
0
14 Jun 2024
Applying Intrinsic Debiasing on Downstream Tasks: Challenges and Considerations for Machine Translation
Bar Iluz
Yanai Elazar
Asaf Yehudai
Gabriel Stanovsky
25
0
0
02 Jun 2024
Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories
Tianlong Wang
Xianfeng Jiao
Yifan He
Zhongzhi Chen
Yinghao Zhu
Xu Chu
Junyi Gao
Yasha Wang
Liantao Ma
LLMSV
26
7
0
26 May 2024
Linearly Controlled Language Generation with Performative Guarantees
Emily Cheng
Marco Baroni
Carmen Amo Alonso
29
2
0
24 May 2024
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
Aleksandar Makelov
Georg Lange
Neel Nanda
24
11
0
14 May 2024
Automating Thematic Analysis: How LLMs Analyse Controversial Topics
Awais Hameed Khan
H. Kegalle
Rhea D'Silva
Ned Watt
Daniel Whelan-Shamy
Lida Ghahremanlou
Liam Magee
21
5
0
11 May 2024
Utility-Fairness Trade-Offs and How to Find Them
Sepehr Dehdashtian
Bashir Sadeghi
Vishnu Naresh Boddeti
28
6
0
15 Apr 2024
ReFT: Representation Finetuning for Language Models
Zhengxuan Wu
Aryaman Arora
Zheng Wang
Atticus Geiger
Daniel Jurafsky
Christopher D. Manning
Christopher Potts
OffRL
30
55
0
04 Apr 2024
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Samuel Marks
Can Rager
Eric J. Michaud
Yonatan Belinkov
David Bau
Aaron Mueller
30
110
0
28 Mar 2024
Can Large Language Models (or Humans) Disentangle Text?
Nicolas Audinet de Pieuchon
Adel Daoud
Connor Jerzak
Moa Johansson
Richard Johansson
28
0
0
25 Mar 2024
What Happens to a Dataset Transformed by a Projection-based Concept Removal Method?
Richard Johansson
24
0
0
24 Mar 2024
Detoxifying Large Language Models via Knowledge Editing
Meng Wang
Ningyu Zhang
Ziwen Xu
Zekun Xi
Shumin Deng
Yunzhi Yao
Qishen Zhang
Linyi Yang
Jindong Wang
Huajun Chen
KELM
25
48
0
21 Mar 2024
Towards a theory of model distillation
Enric Boix-Adserà
FedML
VLM
31
5
0
14 Mar 2024
Ethos: Rectifying Language Models in Orthogonal Parameter Space
Lei Gao
Yue Niu
Tingting Tang
A. Avestimehr
Murali Annavaram
MU
19
9
0
13 Mar 2024
Guardrail Baselines for Unlearning in LLMs
Pratiksha Thaker
Yash Maurya
Shengyuan Hu
Zhiwei Steven Wu
Virginia Smith
MU
30
33
0
05 Mar 2024
1
2
Next