v1v2v3 (latest)

Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

14 May 2024

Aleksandar Makelov

Georg Lange

Neel Nanda

ArXiv (abs)PDF HTML HuggingFace (2 upvotes)

Papers citing "Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control"

44 / 44 papers shown

Title
Extracting Interaction-Aware Monosemantic Concepts in Recommender Systems Dor Arviv Yehonatan Elisha Oren Barkan Noam Koenigstein 74 0 0 22 Nov 2025
Feature-Guided SAE Steering for Refusal-Rate Control using Contrasting Prompts Samaksh Bhargav Zining Zhu LLMSV 192 0 0 26 Oct 2025
Discursive Circuits: How Do Language Models Understand Discourse Relations? Yisong Miao Min-Yen Kan 140 2 0 13 Oct 2025
A Unified Probabilistic Framework for Dictionary Learning with Parsimonious Activation Zihui Zhao Yuanbo Tang Jieyu Ren Xiaoping Zhang Yang Li 76 0 0 30 Sep 2025
Measuring Sparse Autoencoder Feature Sensitivity Claire Tian Katherine Tian Nathan Hu 200 0 0 28 Sep 2025
Visual CoT Makes VLMs Smarter but More Fragile Chunxue Xu Yiwei Wang Yujun Cai Bryan Hooi Songze Li MLLM LRM 138 0 0 28 Sep 2025
LLM Interpretability with Identifiable Temporal-Instantaneous Representation Xiangchen Song Jiaqi Sun Zijian Li Yujia Zheng Kun Zhang 116 0 0 27 Sep 2025
Layer-Wise Perturbations via Sparse Autoencoders for Adversarial Text Generation Huizhen Shu Xuying Li Qirui Wang Yuji Kosuga Mengqiu Tian Zhuo Li AAML SILM 157 0 0 14 Aug 2025
Resurrecting the Salmon: Rethinking Mechanistic Interpretability with Domain-Specific Sparse Autoencoders Charles OÑeill Mudith Jayasekara Max Kirkby 97 1 0 12 Aug 2025
Mammo-SAE: Interpreting Breast Cancer Concept Learning with Sparse Autoencoders Krishna Kanth Nakka 129 1 0 21 Jul 2025
Mechanistic Indicators of Understanding in Large Language Models Pierre Beckmann Matthieu Queloz 247 2 0 07 Jul 2025
Rethinking Explainability in the Era of Multimodal AI Chirag Agarwal 223 1 0 16 Jun 2025
InverseScope: Scalable Activation Inversion for Interpreting Large Language Models Yifan Luo Zhennan Zhou Bin Dong 159 0 0 09 Jun 2025
Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety Seongmin Lee Aeree Cho Grace C. Kim ShengYun Peng Mansi Phute Duen Horng Chau LM&MA AI4CE 273 3 0 05 Jun 2025
Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models Patrick Leask Neel Nanda Noura Al Moubayed 275 1 0 23 May 2025
Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations Aaron Jiaxun Li Suraj Srinivas Usha Bhalla Himabindu Lakkaraju AAML 355 4 0 21 May 2025
Empirical Evaluation of Progressive Coding for Sparse Autoencoders Hans Peter Anders Søgaard 251 0 0 30 Apr 2025
Activation Patching for Interpretable Steering in Music Generation Simone Facchiano Giorgio Strano Donato Crisostomi Irene Tallini Tommaso Mencattini Fabio Galasso Emanuele Rodolà LLMSV 223 2 0 06 Apr 2025
Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning Julian Minder Clement Dumas Caden Juang Bilal Chugtai Neel Nanda 523 1 0 03 Apr 2025
Learning Multi-Level Features with Matryoshka Sparse Autoencoders Bart Bussmann Noa Nabeshima Adam Karvonen Neel Nanda 305 49 0 21 Mar 2025
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability Adam Karvonen Can Rager Johnny Lin Curt Tigges Joseph Isaac Bloom ... Matthew Wearden Arthur Conmy Arthur Conmy Samuel Marks Neel Nanda MU 581 51 0 12 Mar 2025
Do Sparse Autoencoders Generalize? A Case Study of Answerability Lovis Heindrich Juil Sock Fazl Barez Veronika Thost 473 5 0 27 Feb 2025
FADE: Why Bad Descriptions Happen to Good FeaturesAnnual Meeting of the Association for Computational Linguistics (ACL), 2025 Bruno Puri Aakriti Jain Elena Golimblevskaia Patrick Kahardipraja Thomas Wiegand Wojciech Samek Sebastian Lapuschkin 555 6 0 24 Feb 2025
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous WordsInternational Conference on Learning Representations (ICLR), 2025 Gouki Minegishi Hiroki Furuta Yusuke Iwasawa Y. Matsuo 387 9 0 09 Jan 2025
Isometry pursuit Samson Koelle M. Meilă 253 0 0 27 Nov 2024
Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders Charles OÑeill David Klindt David Klindt 370 6 0 20 Nov 2024
Towards Unifying Interpretability and Control: Evaluation via Intervention Usha Bhalla Suraj Srinivas Asma Ghandeharioun Himabindu Lakkaraju 375 17 0 07 Nov 2024
Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation ModelsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024 Aashiq Muhamed Mona Diab Virginia Smith 238 8 0 01 Nov 2024
Attention Speaks Volumes: Localizing and Mitigating Bias in Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024 Rishabh Adiga Besmira Nushi Varun Chandrasekaran 231 4 0 29 Oct 2024
Group-SAE: Efficient Training of Sparse Autoencoders for Large Language Models via Layer Groups Davide Ghilardi Federico Belotti Marco Molinari Tao Ma Matteo Palmonari 237 9 0 28 Oct 2024
Analyzing (In)Abilities of SAEs via Formal LanguagesNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024 Abhinav Menon Manish Shrivastava David M. Krueger Ekdeep Singh Lubana 548 13 0 15 Oct 2024
SAGE: Scalable Ground Truth Evaluations for Large Sparse Autoencoders Constantin Venhoff Anisoara Calinescu Juil Sock Christian Schroeder de Witt 174 1 0 09 Oct 2024
Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small Maheep Chaudhary Atticus Geiger 239 28 0 05 Sep 2024
Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear RepresentationsBlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackBoxNLP), 2024 Róbert Csordás Christopher Potts Christopher D. Manning Atticus Geiger GAN 194 32 0 20 Aug 2024
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackBoxNLP), 2024 Tom Lieberum Senthooran Rajamanoharan Arthur Conmy Lewis Smith Nicolas Sonnerat Vikrant Varma János Kramár Anca Dragan Rohin Shah Neel Nanda 335 225 0 09 Aug 2024
The Quest for the Right Mediator: Surveying Mechanistic Interpretability Through the Lens of Causal Mediation AnalysisComputational Linguistics (CL), 2024 Aaron Mueller Jannik Brinkmann Millicent Li Samuel Marks Koyena Pal ... Arnab Sen Sharma Jiuding Sun Eric Todd David Bau Yonatan Belinkov CML 486 34 0 02 Aug 2024
Disentangling Dense Embeddings with Sparse Autoencoders Charles OÑeill Christine Ye K. Iyer John F. Wu 269 12 0 01 Aug 2024
Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models Adam Karvonen Benjamin Wright Can Rager Rico Angell Jannik Brinkmann Logan Smith C. M. Verdun David Bau Samuel Marks 293 45 0 31 Jul 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 617 80 0 02 Jul 2024
Interpreting Attention Layer Outputs with Sparse Autoencoders Connor Kissane Robert Krzyzanowski Joseph Isaac Bloom Arthur Conmy Neel Nanda MILM 247 38 0 25 Jun 2024
Refusal in Language Models Is Mediated by a Single Direction Andy Arditi Oscar Obeso Aaquib Syed Daniel Paleka Nina Panickssery Wes Gurnee Neel Nanda 362 408 0 17 Jun 2024
Scaling and evaluating sparse autoencoders Leo Gao Tom Dupré la Tour Henk Tillman Gabriel Goh Rajan Troll Alec Radford Ilya Sutskever Jan Leike Jeffrey Wu 260 290 0 06 Jun 2024
Identifying Functionally Important Features with End-to-End Sparse Dictionary LearningNeural Information Processing Systems (NeurIPS), 2024 Dan Braun Jordan K. Taylor Nicholas Goldowsky-Dill Lee D. Sharkey 288 53 0 17 May 2024
LEACE: Perfect linear concept erasure in closed formNeural Information Processing Systems (NeurIPS), 2023 Nora Belrose David Schneider-Joseph Shauli Ravfogel Robert Bamler Edward Raff Stella Biderman KELM MU 778 164 0 06 Jun 2023