ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.09532
  4. Cited By
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability
v1v2v3v4 (latest)

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

12 March 2025
Adam Karvonen
Can Rager
Johnny Lin
Curt Tigges
Joseph Isaac Bloom
David Chanin
Yeu-Tong Lau
Eoin Farrell
Callum McDougall
Kola Ayonrinde
Matthew Wearden
Arthur Conmy
Arthur Conmy
Samuel Marks
Neel Nanda
    MU
ArXiv (abs)PDFHTML

Papers citing "SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability"

36 / 36 papers shown
Title
Towards Open-Ended Visual Scientific Discovery with Sparse Autoencoders
Towards Open-Ended Visual Scientific Discovery with Sparse Autoencoders
Samuel Stevens
Jacob Beattie
T. Berger-Wolf
Yu-Chuan Su
48
0
0
21 Nov 2025
Sparse Autoencoders are Topic Models
Leander Girrbach
Zeynep Akata
69
0
0
20 Nov 2025
Visual Exploration of Feature Relationships in Sparse Autoencoders with Curated Concepts
Visual Exploration of Feature Relationships in Sparse Autoencoders with Curated Concepts
Xinyuan Yan
Shusen Liu
Kowshik Thopalli
Bei Wang
84
0
0
08 Nov 2025
Re-envisioning Euclid Galaxy Morphology: Identifying and Interpreting Features with Sparse Autoencoders
Re-envisioning Euclid Galaxy Morphology: Identifying and Interpreting Features with Sparse Autoencoders
John F. Wu
Michael Walmsley
81
0
0
27 Oct 2025
Does higher interpretability imply better utility? A Pairwise Analysis on Sparse Autoencoders
Does higher interpretability imply better utility? A Pairwise Analysis on Sparse Autoencoders
Xu Wang
Yan Hu
Benyou Wang
Difan Zou
LLMSV
176
1
0
04 Oct 2025
AbsTopK: Rethinking Sparse Autoencoders For Bidirectional Features
AbsTopK: Rethinking Sparse Autoencoders For Bidirectional Features
Xudong Zhu
Mohammad Mahdi Khalili
Zhihui Zhu
196
0
0
01 Oct 2025
Measuring Sparse Autoencoder Feature Sensitivity
Measuring Sparse Autoencoder Feature Sensitivity
Claire Tian
Katherine Tian
Nathan Hu
148
0
0
28 Sep 2025
LLM Interpretability with Identifiable Temporal-Instantaneous Representation
LLM Interpretability with Identifiable Temporal-Instantaneous Representation
Xiangchen Song
Jiaqi Sun
Zijian Li
Yujia Zheng
Kun Zhang
72
0
0
27 Sep 2025
Analysis of Variational Sparse Autoencoders
Analysis of Variational Sparse Autoencoders
Zachary Baker
Yuxiao Li
DRL
231
0
0
26 Sep 2025
OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features
OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features
Anton Korznikov
Andrey V. Galichin
Alexey Dontsov
Oleg Y. Rogov
Elena Tutubalina
Ivan Oseledets
100
0
0
26 Sep 2025
ConceptViz: A Visual Analytics Approach for Exploring Concepts in Large Language Models
ConceptViz: A Visual Analytics Approach for Exploring Concepts in Large Language Models
Xue Yang
Zhen Wen
Qiqi Jiang
Chenxiao Li
Yuwei Wu
Y. Yang
Yiyao Wang
Xiuqi Huang
Minfeng Zhu
Wei Chen
111
0
0
20 Sep 2025
Rethinking Sparse Autoencoders: Select-and-Project for Fairness and Control from Encoder Features Alone
Rethinking Sparse Autoencoders: Select-and-Project for Fairness and Control from Encoder Features Alone
Antonio Bărbălău
Cristian Daniel Păduraru
Teodor Poncu
Alexandru Tifrea
Elena Burceanu
100
0
0
13 Sep 2025
Sparse Autoencoder Neural Operators: Model Recovery in Function Spaces
Sparse Autoencoder Neural Operators: Model Recovery in Function Spaces
Bahareh Tolooshams
Ailsa Shen
A. Anandkumar
61
0
0
03 Sep 2025
CE-Bench: Towards a Reliable Contrastive Evaluation Benchmark of Interpretability of Sparse Autoencoders
CE-Bench: Towards a Reliable Contrastive Evaluation Benchmark of Interpretability of Sparse Autoencoders
Alex Gulko
Yusen Peng
Sachin Kumar
84
0
0
31 Aug 2025
Distribution-Aware Feature Selection for SAEs
Distribution-Aware Feature Selection for SAEs
Narmeen Oozeer
Nirmalendu Prakash
Michael Lan
Alice Rigg
Amirali Abdullah
59
0
0
29 Aug 2025
AdaptiveK Sparse Autoencoders: Dynamic Sparsity Allocation for Interpretable LLM Representations
AdaptiveK Sparse Autoencoders: Dynamic Sparsity Allocation for Interpretable LLM Representations
Yifei Yao
Mengnan Du
111
0
0
24 Aug 2025
Mitigating Spurious Correlations in LLMs via Causality-Aware Post-Training
Shurui Gui
Shuiwang Ji
LRM
206
2
0
11 Jun 2025
Transferring Linear Features Across Language Models With Model Stitching
Transferring Linear Features Across Language Models With Model Stitching
Alan Chen
Jack Merullo
Alessandro Stolfo
Ellie Pavlick
172
1
0
07 Jun 2025
Train One Sparse Autoencoder Across Multiple Sparsity Budgets to Preserve Interpretability and Accuracy
Train One Sparse Autoencoder Across Multiple Sparsity Budgets to Preserve Interpretability and Accuracy
Nikita Balagansky
Gleb Gerasimov
Yaroslav Aksenov
Daniil Laptev
Vadim Kurochkin
Nikita Koryagin
Daniil Gavrilov
173
0
0
30 May 2025
Train Sparse Autoencoders Efficiently by Utilizing Features Correlation
Train Sparse Autoencoders Efficiently by Utilizing Features Correlation
Daniil Laptev
Gleb Gerasimov
Yaroslav Aksenov
Daniil Gavrilov
Nikita Balagansky
161
0
0
28 May 2025
Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders
Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders
James Oldfield
Shawn Im
Yixuan Li
M. Nicolaou
Ioannis Patras
Grigorios G. Chrysos
MoE
252
0
0
27 May 2025
DB-KSVD: Scalable Alternating Optimization for Disentangling High-Dimensional Embedding Spaces
DB-KSVD: Scalable Alternating Optimization for Disentangling High-Dimensional Embedding Spaces
Romeo Valentin
Sydney M. Katz
Vincent Vanhoucke
Mykel J. Kochenderfer
151
0
0
24 May 2025
Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms
Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target AtomsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Mengru Wang
Ziwen Xu
Shengyu Mao
Shumin Deng
Zhaopeng Tu
Ningyu Zhang
Ningyu Zhang
LLMSV
370
8
0
23 May 2025
Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models
Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models
Patrick Leask
Neel Nanda
Noura Al Moubayed
223
1
0
23 May 2025
Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations
Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations
Aaron Jiaxun Li
Suraj Srinivas
Usha Bhalla
Himabindu Lakkaraju
AAML
275
3
0
21 May 2025
Ensembling Sparse Autoencoders
Ensembling Sparse Autoencoders
Soham Gadgil
Chris Lin
Su-In Lee
234
1
0
21 May 2025
Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders
Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders
David Chanin
Tomáš Dulka
Adrià Garriga-Alonso
284
4
0
16 May 2025
Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii
Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii
Kola Ayonrinde
Louis Jaburi
XAI
230
2
0
02 May 2025
MIB: A Mechanistic Interpretability Benchmark
MIB: A Mechanistic Interpretability Benchmark
Aaron Mueller
Atticus Geiger
Sarah Wiegreffe
Dana Arad
Iván Arcuschin
...
Alessandro Stolfo
Martin Tutek
Amir Zur
David Bau
Yonatan Belinkov
532
6
0
17 Apr 2025
SAEs $\textit{Can}$ Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs
SAEs Can\textit{Can}Can Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs
Aashiq Muhamed
Jacopo Bonato
Mona Diab
Virginia Smith
MU
308
16
0
11 Apr 2025
Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality
Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality
Sewoong Lee
Adam Davies
Marc E. Canby
Anjali Narayan-Chen
LLMSV
175
2
0
31 Mar 2025
Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need
Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need
Adam Karvonen
225
1
0
21 Mar 2025
Learning Multi-Level Features with Matryoshka Sparse Autoencoders
Learning Multi-Level Features with Matryoshka Sparse Autoencoders
Bart Bussmann
Noa Nabeshima
Adam Karvonen
Neel Nanda
257
43
0
21 Mar 2025
Sparse Autoencoders Can Interpret Randomly Initialized Transformers
Sparse Autoencoders Can Interpret Randomly Initialized Transformers
Thomas Heap
Tim Lawson
Lucy Farnik
Laurence Aitchison
185
27
0
29 Jan 2025
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous WordsInternational Conference on Learning Representations (ICLR), 2025
Gouki Minegishi
Hiroki Furuta
Yusuke Iwasawa
Y. Matsuo
327
9
0
09 Jan 2025
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Samuel Marks
Can Rager
Eric J. Michaud
Yonatan Belinkov
David Bau
Aaron Mueller
451
233
0
28 Mar 2024
1