ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2306.03819
  4. Cited By
LEACE: Perfect linear concept erasure in closed form

LEACE: Perfect linear concept erasure in closed form

6 June 2023
Nora Belrose
David Schneider-Joseph
Shauli Ravfogel
Ryan Cotterell
Edward Raff
Stella Biderman
    KELM
    MU
ArXivPDFHTML

Papers citing "LEACE: Perfect linear concept erasure in closed form"

41 / 91 papers shown
Title
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Nathaniel Li
Alexander Pan
Anjali Gopal
Summer Yue
Daniel Berrios
...
Yan Shoshitaishvili
Jimmy Ba
K. Esvelt
Alexandr Wang
Dan Hendrycks
ELM
32
130
0
05 Mar 2024
AtP*: An efficient and scalable method for localizing LLM behaviour to
  components
AtP*: An efficient and scalable method for localizing LLM behaviour to components
János Kramár
Tom Lieberum
Rohin Shah
Neel Nanda
KELM
31
40
0
01 Mar 2024
RAVEL: Evaluating Interpretability Methods on Disentangling Language
  Model Representations
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
Jing-ling Huang
Zhengxuan Wu
Christopher Potts
Mor Geva
Atticus Geiger
43
24
0
27 Feb 2024
Immunization against harmful fine-tuning attacks
Immunization against harmful fine-tuning attacks
Domenic Rosati
Jan Wehner
Kai Williams
Lukasz Bartoszcze
Jan Batzner
Hassan Sajjad
Frank Rudzicz
AAML
26
15
0
26 Feb 2024
CausalGym: Benchmarking causal interpretability methods on linguistic
  tasks
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
Aryaman Arora
Daniel Jurafsky
Christopher Potts
37
18
0
19 Feb 2024
Representation Surgery: Theory and Practice of Affine Steering
Representation Surgery: Theory and Practice of Affine Steering
Shashwat Singh
Shauli Ravfogel
Jonathan Herzig
Roee Aharoni
Ryan Cotterell
Ponnurangam Kumaraguru
LLMSV
19
12
0
15 Feb 2024
Suppressing Pink Elephants with Direct Principle Feedback
Suppressing Pink Elephants with Direct Principle Feedback
Louis Castricato
Nathan Lile
Suraj Anand
Hailey Schoelkopf
Siddharth Verma
Stella Biderman
50
9
0
12 Feb 2024
Explaining Text Classifiers with Counterfactual Representations
Explaining Text Classifiers with Counterfactual Representations
Pirmin Lemberger
Antoine Saillenfest
29
0
0
01 Feb 2024
A Comprehensive Study of Knowledge Editing for Large Language Models
A Comprehensive Study of Knowledge Editing for Large Language Models
Ningyu Zhang
Yunzhi Yao
Bo Tian
Peng Wang
Shumin Deng
...
Lei Liang
Zhiqiang Zhang
Xiao-Jun Zhu
Jun Zhou
Huajun Chen
KELM
13
73
0
02 Jan 2024
Improving Activation Steering in Language Models with Mean-Centring
Improving Activation Steering in Language Models with Mean-Centring
Ole Jorgensen
Dylan R. Cope
Nandi Schoots
Murray Shanahan
LLMSV
8
27
0
06 Dec 2023
The Ethics of Automating Legal Actors
The Ethics of Automating Legal Actors
Josef Valvoda
Alec Thompson
Ryan Cotterell
Simone Teufel
AILaw
ELM
11
1
0
01 Dec 2023
Fuse to Forget: Bias Reduction and Selective Memorization through Model
  Fusion
Fuse to Forget: Bias Reduction and Selective Memorization through Model Fusion
Kerem Zaman
Leshem Choshen
Shashank Srivastava
KELM
MoMe
11
10
0
13 Nov 2023
Uncovering Intermediate Variables in Transformers using Circuit Probing
Uncovering Intermediate Variables in Transformers using Circuit Probing
Michael A. Lepori
Thomas Serre
Ellie Pavlick
49
7
0
07 Nov 2023
Debiasing Algorithm through Model Adaptation
Debiasing Algorithm through Model Adaptation
Tomasz Limisiewicz
David Marecek
Tomáš Musil
11
12
0
29 Oct 2023
Knowledge Editing for Large Language Models: A Survey
Knowledge Editing for Large Language Models: A Survey
Song Wang
Yaochen Zhu
Haochen Liu
Zaiyi Zheng
Chen Chen
Jundong Li
KELM
66
118
0
24 Oct 2023
Identifying and Adapting Transformer-Components Responsible for Gender
  Bias in an English Language Model
Identifying and Adapting Transformer-Components Responsible for Gender Bias in an English Language Model
Abhijith Chintam
Rahel Beloch
Willem H. Zuidema
Michael Hanna
Oskar van der Wal
10
16
0
19 Oct 2023
Removing Spurious Concepts from Neural Network Representations via Joint
  Subspace Estimation
Removing Spurious Concepts from Neural Network Representations via Joint Subspace Estimation
Floris Holstege
Bram Wouters
Noud van Giersbergen
C. Diks
13
1
0
18 Oct 2023
Emptying the Ocean with a Spoon: Should We Edit Models?
Emptying the Ocean with a Spoon: Should We Edit Models?
Yuval Pinter
Michael Elhadad
KELM
12
26
0
18 Oct 2023
The Curious Case of Hallucinatory (Un)answerability: Finding Truths in
  the Hidden States of Over-Confident Large Language Models
The Curious Case of Hallucinatory (Un)answerability: Finding Truths in the Hidden States of Over-Confident Large Language Models
Aviv Slobodkin
Omer Goldman
Avi Caciularu
Ido Dagan
Shauli Ravfogel
HILM
LRM
23
20
0
18 Oct 2023
In-Context Unlearning: Language Models as Few Shot Unlearners
In-Context Unlearning: Language Models as Few Shot Unlearners
Martin Pawelczyk
Seth Neel
Himabindu Lakkaraju
MU
13
98
0
11 Oct 2023
The Geometry of Truth: Emergent Linear Structure in Large Language Model
  Representations of True/False Datasets
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
Samuel Marks
Max Tegmark
HILM
83
164
0
10 Oct 2023
Can Sensitive Information Be Deleted From LLMs? Objectives for Defending
  Against Extraction Attacks
Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks
Vaidehi Patil
Peter Hase
Mohit Bansal
KELM
AAML
5
90
0
29 Sep 2023
Large Language Model Alignment: A Survey
Large Language Model Alignment: A Survey
Tianhao Shen
Renren Jin
Yufei Huang
Chuang Liu
Weilong Dong
Zishan Guo
Xinwei Wu
Yan Liu
Deyi Xiong
LM&MA
6
169
0
26 Sep 2023
Sparse Autoencoders Find Highly Interpretable Features in Language
  Models
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Hoagy Cunningham
Aidan Ewart
Logan Riggs
R. Huben
Lee Sharkey
MILM
12
289
0
15 Sep 2023
Benchmarks for Detecting Measurement Tampering
Benchmarks for Detecting Measurement Tampering
Fabien Roger
Ryan Greenblatt
Max Nadeau
Buck Shlegeris
Nate Thomas
11
1
0
29 Aug 2023
A Geometric Notion of Causal Probing
A Geometric Notion of Causal Probing
Clément Guerner
Anej Svete
Tianyu Liu
Alex Warstadt
Ryan Cotterell
LLMSV
24
12
0
27 Jul 2023
Stay on topic with Classifier-Free Guidance
Stay on topic with Classifier-Free Guidance
Guillaume Sanchez
Honglu Fan
Alexander Spangher
Elad Levi
Pawan Sasanka Ammanamanchi
Stella Biderman
3DV
23
45
0
30 Jun 2023
An Overview of Catastrophic AI Risks
An Overview of Catastrophic AI Risks
Dan Hendrycks
Mantas Mazeika
Thomas Woodside
SILM
8
162
0
21 Jun 2023
Editing Large Language Models: Problems, Methods, and Opportunities
Editing Large Language Models: Problems, Methods, and Opportunities
Yunzhi Yao
Peng Wang
Bo Tian
Shuyang Cheng
Zhoubo Li
Shumin Deng
Huajun Chen
Ningyu Zhang
KELM
17
275
0
22 May 2023
Emergent and Predictable Memorization in Large Language Models
Emergent and Predictable Memorization in Large Language Models
Stella Biderman
USVSN Sai Prashanth
Lintang Sutawika
Hailey Schoelkopf
Quentin G. Anthony
Shivanshu Purohit
Edward Raf
11
110
0
21 Apr 2023
Computational modeling of semantic change
Computational modeling of semantic change
Nina Tahmasebi
Haim Dubossarsky
18
5
0
13 Apr 2023
Competence-Based Analysis of Language Models
Competence-Based Analysis of Language Models
Adam Davies
Jize Jiang
Chengxiang Zhai
ELM
13
4
0
01 Mar 2023
Log-linear Guardedness and its Implications
Log-linear Guardedness and its Implications
Shauli Ravfogel
Yoav Goldberg
Ryan Cotterell
15
2
0
18 Oct 2022
Causal Conceptions of Fairness and their Consequences
Causal Conceptions of Fairness and their Consequences
H. Nilforoshan
Johann D. Gaebler
Ravi Shroff
Sharad Goel
FaML
124
45
0
12 Jul 2022
Can Transformer be Too Compositional? Analysing Idiom Processing in
  Neural Machine Translation
Can Transformer be Too Compositional? Analysing Idiom Processing in Neural Machine Translation
Verna Dankers
Christopher G. Lucas
Ivan Titov
25
28
0
30 May 2022
Linear Adversarial Concept Erasure
Linear Adversarial Concept Erasure
Shauli Ravfogel
Michael Twiton
Yoav Goldberg
Ryan Cotterell
KELM
62
56
0
28 Jan 2022
All Bark and No Bite: Rogue Dimensions in Transformer Language Models
  Obscure Representational Quality
All Bark and No Bite: Rogue Dimensions in Transformer Language Models Obscure Representational Quality
William Timkey
Marten van Schijndel
213
110
0
09 Sep 2021
The Rediscovery Hypothesis: Language Models Need to Meet Linguistics
The Rediscovery Hypothesis: Language Models Need to Meet Linguistics
Vassilina Nikoulina
Maxat Tezekbayev
Nuradil Kozhakhmet
Madina Babazhanova
Matthias Gallé
Z. Assylbekov
21
7
0
02 Mar 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
236
1,508
0
31 Dec 2020
On the Global Optima of Kernelized Adversarial Representation Learning
On the Global Optima of Kernelized Adversarial Representation Learning
Bashir Sadeghi
Runyi Yu
Vishnu Naresh Boddeti
AAML
56
29
0
16 Oct 2019
Adversarial Deep Averaging Networks for Cross-Lingual Sentiment
  Classification
Adversarial Deep Averaging Networks for Cross-Lingual Sentiment Classification
Xilun Chen
Yu Sun
Ben Athiwaratkun
Claire Cardie
Kilian Q. Weinberger
203
315
0
06 Jun 2016
Previous
12