Sparse Autoencoders Do Not Find Canonical Units of Analysis

Sparse Autoencoders Do Not Find Canonical Units of Analysis

7 February 2025

Michael T. Pearce

Joseph Isaac Bloom

Noura Al Moubayed

ArXiv (abs)PDF HTML

Papers citing "Sparse Autoencoders Do Not Find Canonical Units of Analysis"

12 / 12 papers shown

Title
Dense SAE Latents Are Features, Not Bugs Xiaoqing Sun Alessandro Stolfo Joshua Engels Ben Wu Senthooran Rajamanoharan Mrinmaya Sachan Max Tegmark 54 0 0 18 Jun 2025
Decoding Dense Embeddings: Sparse Autoencoders for Interpreting and Discretizing Dense Retrieval Seongwan Park Taeklim Kim Youngjoong Ko 18 0 0 28 May 2025
Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms Mengru Wang Ziwen Xu Shengyu Mao Shumin Deng Zhaopeng Tu Ningyu Zhang N. Zhang LLMSV 114 0 0 23 May 2025
Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models Patrick Leask Neel Nanda Noura Al Moubayed 87 1 0 23 May 2025
Ensembling Sparse Autoencoders Soham Gadgil Chris Lin Su-In Lee 87 0 0 21 May 2025
Breaking Bad Tokens: Detoxification of LLMs Using Sparse Autoencoders Agam Goyal Vedant Rathi William Yeh Yian Wang Yuen Chen Hari Sundaram 100 0 0 20 May 2025
SplInterp: Improving our Understanding and Training of Sparse Autoencoders Jeremy Budd Javier Ideami Benjamin Macdowall Rynne Keith Duggar Randall Balestriero 99 0 0 17 May 2025
Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii Kola Ayonrinde Louis Jaburi XAI 147 1 0 02 May 2025
Representation Learning on a Random Lattice Aryeh Brill OOD FAtt AI4CE 120 0 0 28 Apr 2025
Learning Multi-Level Features with Matryoshka Sparse Autoencoders Bart Bussmann Noa Nabeshima Adam Karvonen Neel Nanda 126 13 0 21 Mar 2025
How LLMs Learn: Tracing Internal Representations with Sparse Autoencoders Tatsuro Inaba Kentaro Inui Yusuke Miyao Yohei Oseki Benjamin Heinzerling Yu Takagi 104 1 0 09 Mar 2025
Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry Sai Sumedh R. Hindupur Ekdeep Singh Lubana Thomas Fel Demba Ba 105 10 0 03 Mar 2025