ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2405.08366
  4. Cited By
Towards Principled Evaluations of Sparse Autoencoders for
  Interpretability and Control
v1v2v3 (latest)

Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

14 May 2024
Aleksandar Makelov
Georg Lange
Neel Nanda
ArXiv (abs)PDFHTMLHuggingFace (2 upvotes)

Papers citing "Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control"

44 / 44 papers shown
Title
Extracting Interaction-Aware Monosemantic Concepts in Recommender Systems
Extracting Interaction-Aware Monosemantic Concepts in Recommender Systems
Dor Arviv
Yehonatan Elisha
Oren Barkan
Noam Koenigstein
74
0
0
22 Nov 2025
Feature-Guided SAE Steering for Refusal-Rate Control using Contrasting Prompts
Feature-Guided SAE Steering for Refusal-Rate Control using Contrasting Prompts
Samaksh Bhargav
Zining Zhu
LLMSV
192
0
0
26 Oct 2025
Discursive Circuits: How Do Language Models Understand Discourse Relations?
Discursive Circuits: How Do Language Models Understand Discourse Relations?
Yisong Miao
Min-Yen Kan
140
2
0
13 Oct 2025
A Unified Probabilistic Framework for Dictionary Learning with Parsimonious Activation
A Unified Probabilistic Framework for Dictionary Learning with Parsimonious Activation
Zihui Zhao
Yuanbo Tang
Jieyu Ren
Xiaoping Zhang
Yang Li
76
0
0
30 Sep 2025
Measuring Sparse Autoencoder Feature Sensitivity
Measuring Sparse Autoencoder Feature Sensitivity
Claire Tian
Katherine Tian
Nathan Hu
200
0
0
28 Sep 2025
Visual CoT Makes VLMs Smarter but More Fragile
Visual CoT Makes VLMs Smarter but More Fragile
Chunxue Xu
Yiwei Wang
Yujun Cai
Bryan Hooi
Songze Li
MLLMLRM
138
0
0
28 Sep 2025
LLM Interpretability with Identifiable Temporal-Instantaneous Representation
LLM Interpretability with Identifiable Temporal-Instantaneous Representation
Xiangchen Song
Jiaqi Sun
Zijian Li
Yujia Zheng
Kun Zhang
116
0
0
27 Sep 2025
Layer-Wise Perturbations via Sparse Autoencoders for Adversarial Text Generation
Layer-Wise Perturbations via Sparse Autoencoders for Adversarial Text Generation
Huizhen Shu
Xuying Li
Qirui Wang
Yuji Kosuga
Mengqiu Tian
Zhuo Li
AAMLSILM
157
0
0
14 Aug 2025
Resurrecting the Salmon: Rethinking Mechanistic Interpretability with Domain-Specific Sparse Autoencoders
Resurrecting the Salmon: Rethinking Mechanistic Interpretability with Domain-Specific Sparse Autoencoders
Charles OÑeill
Mudith Jayasekara
Max Kirkby
97
1
0
12 Aug 2025
Mammo-SAE: Interpreting Breast Cancer Concept Learning with Sparse Autoencoders
Mammo-SAE: Interpreting Breast Cancer Concept Learning with Sparse Autoencoders
Krishna Kanth Nakka
129
1
0
21 Jul 2025
Mechanistic Indicators of Understanding in Large Language Models
Mechanistic Indicators of Understanding in Large Language Models
Pierre Beckmann
Matthieu Queloz
247
2
0
07 Jul 2025
Rethinking Explainability in the Era of Multimodal AI
Rethinking Explainability in the Era of Multimodal AI
Chirag Agarwal
223
1
0
16 Jun 2025
InverseScope: Scalable Activation Inversion for Interpreting Large Language Models
InverseScope: Scalable Activation Inversion for Interpreting Large Language Models
Yifan Luo
Zhennan Zhou
Bin Dong
159
0
0
09 Jun 2025
Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety
Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety
Seongmin Lee
Aeree Cho
Grace C. Kim
ShengYun Peng
Mansi Phute
Duen Horng Chau
LM&MAAI4CE
273
3
0
05 Jun 2025
Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models
Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models
Patrick Leask
Neel Nanda
Noura Al Moubayed
275
1
0
23 May 2025
Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations
Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations
Aaron Jiaxun Li
Suraj Srinivas
Usha Bhalla
Himabindu Lakkaraju
AAML
355
4
0
21 May 2025
Empirical Evaluation of Progressive Coding for Sparse Autoencoders
Empirical Evaluation of Progressive Coding for Sparse Autoencoders
Hans Peter
Anders Søgaard
251
0
0
30 Apr 2025
Activation Patching for Interpretable Steering in Music Generation
Activation Patching for Interpretable Steering in Music Generation
Simone Facchiano
Giorgio Strano
Donato Crisostomi
Irene Tallini
Tommaso Mencattini
Fabio Galasso
Emanuele Rodolà
LLMSV
223
2
0
06 Apr 2025
Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning
Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning
Julian Minder
Clement Dumas
Caden Juang
Bilal Chugtai
Neel Nanda
523
1
0
03 Apr 2025
Learning Multi-Level Features with Matryoshka Sparse Autoencoders
Learning Multi-Level Features with Matryoshka Sparse Autoencoders
Bart Bussmann
Noa Nabeshima
Adam Karvonen
Neel Nanda
305
49
0
21 Mar 2025
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability
Adam Karvonen
Can Rager
Johnny Lin
Curt Tigges
Joseph Isaac Bloom
...
Matthew Wearden
Arthur Conmy
Arthur Conmy
Samuel Marks
Neel Nanda
MU
581
51
0
12 Mar 2025
Do Sparse Autoencoders Generalize? A Case Study of Answerability
Do Sparse Autoencoders Generalize? A Case Study of Answerability
Lovis Heindrich
Juil Sock
Fazl Barez
Veronika Thost
473
5
0
27 Feb 2025
FADE: Why Bad Descriptions Happen to Good Features
FADE: Why Bad Descriptions Happen to Good FeaturesAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Bruno Puri
Aakriti Jain
Elena Golimblevskaia
Patrick Kahardipraja
Thomas Wiegand
Wojciech Samek
Sebastian Lapuschkin
555
6
0
24 Feb 2025
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous WordsInternational Conference on Learning Representations (ICLR), 2025
Gouki Minegishi
Hiroki Furuta
Yusuke Iwasawa
Y. Matsuo
387
9
0
09 Jan 2025
Isometry pursuit
Isometry pursuit
Samson Koelle
M. Meilă
253
0
0
27 Nov 2024
Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders
Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders
Charles OÑeill
David Klindt
David Klindt
370
6
0
20 Nov 2024
Towards Unifying Interpretability and Control: Evaluation via Intervention
Towards Unifying Interpretability and Control: Evaluation via Intervention
Usha Bhalla
Suraj Srinivas
Asma Ghandeharioun
Himabindu Lakkaraju
375
17
0
07 Nov 2024
Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting
  Rare Concepts in Foundation Models
Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation ModelsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024
Aashiq Muhamed
Mona Diab
Virginia Smith
238
8
0
01 Nov 2024
Attention Speaks Volumes: Localizing and Mitigating Bias in Language
  Models
Attention Speaks Volumes: Localizing and Mitigating Bias in Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Rishabh Adiga
Besmira Nushi
Varun Chandrasekaran
231
4
0
29 Oct 2024
Group-SAE: Efficient Training of Sparse Autoencoders for Large Language Models via Layer Groups
Group-SAE: Efficient Training of Sparse Autoencoders for Large Language Models via Layer Groups
Davide Ghilardi
Federico Belotti
Marco Molinari
Tao Ma
Matteo Palmonari
237
9
0
28 Oct 2024
Analyzing (In)Abilities of SAEs via Formal Languages
Analyzing (In)Abilities of SAEs via Formal LanguagesNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024
Abhinav Menon
Manish Shrivastava
David M. Krueger
Ekdeep Singh Lubana
548
13
0
15 Oct 2024
SAGE: Scalable Ground Truth Evaluations for Large Sparse Autoencoders
SAGE: Scalable Ground Truth Evaluations for Large Sparse Autoencoders
Constantin Venhoff
Anisoara Calinescu
Juil Sock
Christian Schroeder de Witt
174
1
0
09 Oct 2024
Evaluating Open-Source Sparse Autoencoders on Disentangling Factual
  Knowledge in GPT-2 Small
Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small
Maheep Chaudhary
Atticus Geiger
239
28
0
05 Sep 2024
Recurrent Neural Networks Learn to Store and Generate Sequences using
  Non-Linear Representations
Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear RepresentationsBlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackBoxNLP), 2024
Róbert Csordás
Christopher Potts
Christopher D. Manning
Atticus Geiger
GAN
194
32
0
20 Aug 2024
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackBoxNLP), 2024
Tom Lieberum
Senthooran Rajamanoharan
Arthur Conmy
Lewis Smith
Nicolas Sonnerat
Vikrant Varma
János Kramár
Anca Dragan
Rohin Shah
Neel Nanda
335
225
0
09 Aug 2024
The Quest for the Right Mediator: Surveying Mechanistic Interpretability Through the Lens of Causal Mediation Analysis
The Quest for the Right Mediator: Surveying Mechanistic Interpretability Through the Lens of Causal Mediation AnalysisComputational Linguistics (CL), 2024
Aaron Mueller
Jannik Brinkmann
Millicent Li
Samuel Marks
Koyena Pal
...
Arnab Sen Sharma
Jiuding Sun
Eric Todd
David Bau
Yonatan Belinkov
CML
486
34
0
02 Aug 2024
Disentangling Dense Embeddings with Sparse Autoencoders
Disentangling Dense Embeddings with Sparse Autoencoders
Charles OÑeill
Christine Ye
K. Iyer
John F. Wu
269
12
0
01 Aug 2024
Measuring Progress in Dictionary Learning for Language Model
  Interpretability with Board Game Models
Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
Adam Karvonen
Benjamin Wright
Can Rager
Rico Angell
Jannik Brinkmann
Logan Smith
C. M. Verdun
David Bau
Samuel Marks
293
45
0
31 Jul 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
Daking Rai
Yilun Zhou
Shi Feng
Abulhair Saparov
Ziyu Yao
617
80
0
02 Jul 2024
Interpreting Attention Layer Outputs with Sparse Autoencoders
Interpreting Attention Layer Outputs with Sparse Autoencoders
Connor Kissane
Robert Krzyzanowski
Joseph Isaac Bloom
Arthur Conmy
Neel Nanda
MILM
247
38
0
25 Jun 2024
Refusal in Language Models Is Mediated by a Single Direction
Refusal in Language Models Is Mediated by a Single Direction
Andy Arditi
Oscar Obeso
Aaquib Syed
Daniel Paleka
Nina Panickssery
Wes Gurnee
Neel Nanda
362
408
0
17 Jun 2024
Scaling and evaluating sparse autoencoders
Scaling and evaluating sparse autoencoders
Leo Gao
Tom Dupré la Tour
Henk Tillman
Gabriel Goh
Rajan Troll
Alec Radford
Ilya Sutskever
Jan Leike
Jeffrey Wu
260
290
0
06 Jun 2024
Identifying Functionally Important Features with End-to-End Sparse
  Dictionary Learning
Identifying Functionally Important Features with End-to-End Sparse Dictionary LearningNeural Information Processing Systems (NeurIPS), 2024
Dan Braun
Jordan K. Taylor
Nicholas Goldowsky-Dill
Lee D. Sharkey
288
53
0
17 May 2024
LEACE: Perfect linear concept erasure in closed form
LEACE: Perfect linear concept erasure in closed formNeural Information Processing Systems (NeurIPS), 2023
Nora Belrose
David Schneider-Joseph
Shauli Ravfogel
Robert Bamler
Edward Raff
Stella Biderman
KELMMU
778
164
0
06 Jun 2023
1