v1v2v3v4v5 (latest)

Kernelized Concept Erasure

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022

28 January 2022

ArXiv (abs)PDF HTML Github

Papers citing "Kernelized Concept Erasure"

20 / 20 papers shown

Can Prompts Rewind Time for LLMs? Evaluating the Effectiveness of Prompted Knowledge Cutoffs

Sai Ashish Somayajula

Pengtao Xie

KELM MU

185

26 Sep 2025

Memory in Large Language Models: Mechanisms, Evaluation and Evolution

273

23 Sep 2025

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

Senthooran Rajamanoharan

Neel Nanda

OODD LLMSV

514

22 Jul 2025

Nonlinear Concept Erasure: a Density Matching Approach

Antoine Saillenfest

Pirmin Lemberger

267

16 Jul 2025

Improving Causal Interventions in Amnesic Probing with Mean Projection or LEACEAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

Alicja Dobrzeniecka

Antske Fokkens

Pia Sommerauer

163

13 Jun 2025

Focus On This, Not That! Steering LLMs with Adaptive Feature Specification

622

30 Oct 2024

Machine Unlearning Fails to Remove Data Poisoning Attacks

627

25 Jun 2024

Exploring Safety-Utility Trade-Offs in Personalized Language Models

Anvesh Rao Vijjini

Somnath Basu Roy Chowdhury

Snigdha Chaturvedi

675

17 Jun 2024

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

828

297

28 Mar 2024

The Ethics of Automating Legal ActorsTransactions of the Association for Computational Linguistics (TACL), 2023

271

01 Dec 2023

Gen-Z: Generative Zero-Shot Text Classification with Contextualized Label DescriptionsInternational Conference on Learning Representations (ICLR), 2023

286

13 Nov 2023

Removing Spurious Concepts from Neural Network Representations via Joint Subspace EstimationInternational Conference on Machine Learning (ICML), 2023

262

18 Oct 2023

LEACE: Perfect linear concept erasure in closed formNeural Information Processing Systems (NeurIPS), 2023

Nora Belrose

David Schneider-Joseph

950

193

06 Jun 2023

Shielded Representations: Protecting Sensitive Attributes Through Iterative Gradient-Based ProjectionAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Shadi Iskander

Kira Radinsky

Yonatan Belinkov

466

17 May 2023

Emergent and Predictable Memorization in Large Language ModelsNeural Information Processing Systems (NeurIPS), 2023

370

181

21 Apr 2023

Competence-Based Analysis of Language Models

417

01 Mar 2023

Self-Destructing Models: Increasing the Costs of Harmful Dual Uses of Foundation ModelsAAAI/ACM Conference on AI, Ethics, and Society (AIES), 2022

Peter Henderson

E. Mitchell

Christopher D. Manning

Dan Jurafsky

Chelsea Finn

268

27 Nov 2022

Probing Classifiers are Unreliable for Concept Removal and DetectionNeural Information Processing Systems (NeurIPS), 2022

393

08 Jul 2022

Naturalistic Causal Probing for Morpho-SyntaxTransactions of the Association for Computational Linguistics (TACL), 2022

364

14 May 2022

Probing for the Usage of Grammatical NumberAnnual Meeting of the Association for Computational Linguistics (ACL), 2022

373

19 Apr 2022