v1v2v3v4 (latest)

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

12 March 2025

Papers citing "SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability"

41 / 41 papers shown

AlignSAE: Concept-Aligned Sparse Autoencoders

Steven Bethard

Mihai Surdeanu

Liangming Pan

LLMSV

323

01 Dec 2025

Towards Open-Ended Visual Scientific Discovery with Sparse Autoencoders

101

21 Nov 2025

Sparse Autoencoders are Topic Models

Leander Girrbach

Zeynep Akata

119

20 Nov 2025

Visual Exploration of Feature Relationships in Sparse Autoencoders with Curated Concepts

165

08 Nov 2025

Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability

30 Oct 2025

Re-envisioning Euclid Galaxy Morphology: Identifying and Interpreting Features with Sparse Autoencoders

John F. Wu

Michael Walmsley

207

27 Oct 2025

Time-Aware Feature Selection: Adaptive Temporal Masking for Stable Sparse Autoencoder Training

T. Ed Li

Junyu Ren

09 Oct 2025

Memory Retrieval and Consolidation in Large Language Models through Function Tokens

100

09 Oct 2025

Does higher interpretability imply better utility? A Pairwise Analysis on Sparse Autoencoders

208

04 Oct 2025

AbsTopK: Rethinking Sparse Autoencoders For Bidirectional Features

Xudong Zhu

Mohammad Mahdi Khalili

Zhihui Zhu

260

01 Oct 2025

Measuring Sparse Autoencoder Feature Sensitivity

Claire Tian

Katherine Tian

Nathan Hu

210

28 Sep 2025

LLM Interpretability with Identifiable Temporal-Instantaneous Representation

128

27 Sep 2025

Analysis of Variational Sparse Autoencoders

Zachary Baker

Yuxiao Li

DRL

324

26 Sep 2025

OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features

139

26 Sep 2025

ConceptViz: A Visual Analytics Approach for Exploring Concepts in Large Language Models

156

20 Sep 2025

Rethinking Sparse Autoencoders: Select-and-Project for Fairness and Control from Encoder Features Alone

Antonio Bărbălău

Cristian Daniel Păduraru

Teodor Poncu

Alexandru Tifrea

Elena Burceanu

179

13 Sep 2025

Sparse Autoencoder Neural Operators: Model Recovery in Function Spaces

Bahareh Tolooshams

Ailsa Shen

A. Anandkumar

03 Sep 2025

CE-Bench: Towards a Reliable Contrastive Evaluation Benchmark of Interpretability of Sparse Autoencoders

Alex Gulko

Yusen Peng

Sachin Kumar

107

31 Aug 2025

Distribution-Aware Feature Selection for SAEs

29 Aug 2025

AdaptiveK Sparse Autoencoders: Dynamic Sparsity Allocation for Interpretable LLM Representations

Yifei Yao

Mengnan Du

173

24 Aug 2025

Measuring and Guiding Monosemanticity

119

24 Jun 2025

Mitigating Spurious Correlations in LLMs via Causality-Aware Post-Training

Shurui Gui

Shuiwang Ji

LRM

255

11 Jun 2025

Transferring Linear Features Across Language Models With Model Stitching

246

07 Jun 2025

Train One Sparse Autoencoder Across Multiple Sparsity Budgets to Preserve Interpretability and Accuracy

241

30 May 2025

Kronecker Factorization Improves Efficiency and Interpretability of Sparse Autoencoders

249

28 May 2025

Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders

318

27 May 2025

DB-KSVD: Scalable Alternating Optimization for Disentangling High-Dimensional Embedding Spaces

Romeo Valentin

Sydney M. Katz

Vincent Vanhoucke

Mykel J. Kochenderfer

213

24 May 2025

Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target AtomsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

450

23 May 2025

Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models

Patrick Leask

Neel Nanda

Noura Al Moubayed

300

23 May 2025

Ensembling Sparse Autoencoders

Soham Gadgil

Chris Lin

Su-In Lee

294

21 May 2025

Evaluating Adversarial Robustness of Concept Representations in Sparse Autoencoders

370

21 May 2025

Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders

David Chanin

Tomáš Dulka

Adrià Garriga-Alonso

397

16 May 2025

Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii

Kola Ayonrinde

Louis Jaburi

XAI

311

02 May 2025

MIB: A Mechanistic Interpretability Benchmark

...

677

17 Apr 2025

$SAEs $\textit{Can}$ Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs$

SAEs

\textit{Can}

Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs

419

11 Apr 2025

Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality

207

31 Mar 2025

Learning Multi-Level Features with Matryoshka Sparse Autoencoders

313

21 Mar 2025

Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need

Adam Karvonen

283

21 Mar 2025

Automated Interpretability Metrics Do Not Distinguish Trained and Random Transformers

257

29 Jan 2025

Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous WordsInternational Conference on Learning Representations (ICLR), 2025

401

09 Jan 2025

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

574

251

28 Mar 2024