Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders

v1v2 (latest)

Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders

16 May 2025

Adrià Garriga-Alonso

ArXiv (abs)PDF HTML Github (3★)

Papers citing "Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders"

7 / 7 papers shown

Title
OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features Anton Korznikov Andrey V. Galichin Alexey Dontsov Oleg Y. Rogov Elena Tutubalina Ivan Oseledets 104 0 0 26 Sep 2025
Towards Atoms of Large Language Models Chenhui Hu Pengfei Cao Yubo Chen Kang Liu Jun Zhao 96 0 0 25 Sep 2025
Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders David Chanin Adrià Garriga-Alonso 128 0 0 22 Aug 2025
Learning Multi-Level Features with Matryoshka Sparse Autoencoders Bart Bussmann Noa Nabeshima Adam Karvonen Neel Nanda 257 43 0 21 Mar 2025
Are Sparse Autoencoders Useful? A Case Study in Sparse Probing Subhash Kantamneni Joshua Engels Senthooran Rajamanoharan Max Tegmark Neel Nanda 312 39 0 23 Feb 2025
Decomposing The Dark Matter of Sparse Autoencoders Joshua Engels Logan Riggs Max Tegmark LLMSV 274 29 0 18 Oct 2024
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models Samuel Marks Can Rager Eric J. Michaud Yonatan Belinkov David Bau Aaron Mueller 451 233 0 28 Mar 2024