Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2503.09532
Cited By
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability
12 March 2025
Adam Karvonen
Can Rager
Johnny Lin
Curt Tigges
Joseph Isaac Bloom
David Chanin
Yeu-Tong Lau
Eoin Farrell
Callum McDougall
Kola Ayonrinde
Matthew Wearden
Arthur Conmy
Samuel Marks
Neel Nanda
MU
Re-assign community
ArXiv
PDF
HTML
Papers citing
"SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability"
7 / 7 papers shown
Title
Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii
Kola Ayonrinde
Louis Jaburi
XAI
71
1
0
02 May 2025
MIB: A Mechanistic Interpretability Benchmark
Aaron Mueller
Atticus Geiger
Sarah Wiegreffe
Dana Arad
Iván Arcuschin
...
Alessandro Stolfo
Martin Tutek
Amir Zur
David Bau
Yonatan Belinkov
41
1
0
17 Apr 2025
SAEs
Can
\textit{Can}
Can
Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs
Aashiq Muhamed
Jacopo Bonato
Mona Diab
Virginia Smith
MU
58
0
0
11 Apr 2025
Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality
Sewoong Lee
Adam Davies
Marc E. Canby
J. Hockenmaier
LLMSV
65
0
0
31 Mar 2025
Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need
Adam Karvonen
34
0
0
21 Mar 2025
Learning Multi-Level Features with Matryoshka Sparse Autoencoders
Bart Bussmann
Noa Nabeshima
Adam Karvonen
Neel Nanda
54
0
0
21 Mar 2025
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words
Gouki Minegishi
Hiroki Furuta
Yusuke Iwasawa
Y. Matsuo
49
1
0
09 Jan 2025
1