Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2501.17727
Cited By
Sparse Autoencoders Can Interpret Randomly Initialized Transformers
29 January 2025
Thomas Heap
Tim Lawson
Lucy Farnik
Laurence Aitchison
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Sparse Autoencoders Can Interpret Randomly Initialized Transformers"
15 / 15 papers shown
Title
Rethinking Explainability in the Era of Multimodal AI
Chirag Agarwal
22
0
0
16 Jun 2025
Incorporating Hierarchical Semantics in Sparse Autoencoder Architectures
Mark Muchane
Sean Richardson
Kiho Park
Victor Veitch
36
0
0
01 Jun 2025
Interpreting Large Text-to-Image Diffusion Models with Dictionary Learning
Stepan Shabalin
Ayush Panda
Dmitrii Kharlapenko
Abdur Raheem Ali
Yixiong Hao
Arthur Conmy
DiffM
47
0
0
30 May 2025
Train Sparse Autoencoders Efficiently by Utilizing Features Correlation
Vadim Kurochkin
Yaroslav Aksenov
Daniil Laptev
Daniil Gavrilov
Nikita Balagansky
60
0
0
28 May 2025
Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models
Patrick Leask
Neel Nanda
Noura Al Moubayed
87
1
0
23 May 2025
Textual Steering Vectors Can Improve Visual Understanding in Multimodal Large Language Models
Woody Haosheng Gan
Deqing Fu
Julian Asilis
Ollie Liu
Dani Yogatama
Vatsal Sharan
Robin Jia
Willie Neiswanger
LLMSV
87
1
0
20 May 2025
Explaining Neural Networks with Reasons
Levin Hornischer
Hannes Leitgeb
FAtt
AAML
MILM
95
0
0
20 May 2025
SplInterp: Improving our Understanding and Training of Sparse Autoencoders
Jeremy Budd
Javier Ideami
Benjamin Macdowall Rynne
Keith Duggar
Randall Balestriero
109
0
0
17 May 2025
Probing the Vulnerability of Large Language Models to Polysemantic Interventions
Bofan Gong
Shiyang Lai
Dawn Song
AAML
MILM
61
1
0
16 May 2025
Are Sparse Autoencoders Useful for Java Function Bug Detection?
Rui Melo
Claudia Mamede
Andre Catarino
Rui Abreu
Henrique Lopes Cardoso
130
0
0
15 May 2025
Disentangling Polysemantic Channels in Convolutional Neural Networks
Robin Hesse
Jonas Fischer
Simone Schaub-Meyer
Stefan Roth
FAtt
MILM
108
0
0
17 Apr 2025
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability
Adam Karvonen
Can Rager
Johnny Lin
Curt Tigges
Joseph Isaac Bloom
...
Matthew Wearden
Arthur Conmy
Arthur Conmy
Samuel Marks
Neel Nanda
MU
164
23
0
12 Mar 2025
Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations
Lucy Farnik
Tim Lawson
Conor Houghton
Laurence Aitchison
107
1
0
25 Feb 2025
FADE: Why Bad Descriptions Happen to Good Features
Bruno Puri
Aakriti Jain
Elena Golimblevskaia
Patrick Kahardipraja
Thomas Wiegand
Wojciech Samek
Sebastian Lapuschkin
270
1
0
24 Feb 2025
Are Sparse Autoencoders Useful? A Case Study in Sparse Probing
Subhash Kantamneni
Joshua Engels
Senthooran Rajamanoharan
Max Tegmark
Neel Nanda
141
17
0
23 Feb 2025
1