Sparse Autoencoders Can Interpret Randomly Initialized Transformers

Sparse Autoencoders Can Interpret Randomly Initialized Transformers

29 January 2025

Laurence Aitchison

ArXiv (abs)PDF HTML

Papers citing "Sparse Autoencoders Can Interpret Randomly Initialized Transformers"

15 / 15 papers shown

Title
Rethinking Explainability in the Era of Multimodal AI Chirag Agarwal 22 0 0 16 Jun 2025
Incorporating Hierarchical Semantics in Sparse Autoencoder Architectures Mark Muchane Sean Richardson Kiho Park Victor Veitch 36 0 0 01 Jun 2025
Interpreting Large Text-to-Image Diffusion Models with Dictionary Learning Stepan Shabalin Ayush Panda Dmitrii Kharlapenko Abdur Raheem Ali Yixiong Hao Arthur Conmy DiffM 47 0 0 30 May 2025
Train Sparse Autoencoders Efficiently by Utilizing Features Correlation Vadim Kurochkin Yaroslav Aksenov Daniil Laptev Daniil Gavrilov Nikita Balagansky 60 0 0 28 May 2025
Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models Patrick Leask Neel Nanda Noura Al Moubayed 87 1 0 23 May 2025
Textual Steering Vectors Can Improve Visual Understanding in Multimodal Large Language Models Woody Haosheng Gan Deqing Fu Julian Asilis Ollie Liu Dani Yogatama Vatsal Sharan Robin Jia Willie Neiswanger LLMSV 87 1 0 20 May 2025
Explaining Neural Networks with Reasons Levin Hornischer Hannes Leitgeb FAtt AAML MILM 95 0 0 20 May 2025
SplInterp: Improving our Understanding and Training of Sparse Autoencoders Jeremy Budd Javier Ideami Benjamin Macdowall Rynne Keith Duggar Randall Balestriero 109 0 0 17 May 2025
Probing the Vulnerability of Large Language Models to Polysemantic Interventions Bofan Gong Shiyang Lai Dawn Song AAML MILM 61 1 0 16 May 2025
Are Sparse Autoencoders Useful for Java Function Bug Detection? Rui Melo Claudia Mamede Andre Catarino Rui Abreu Henrique Lopes Cardoso 130 0 0 15 May 2025
Disentangling Polysemantic Channels in Convolutional Neural Networks Robin Hesse Jonas Fischer Simone Schaub-Meyer Stefan Roth FAtt MILM 108 0 0 17 Apr 2025
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability Adam Karvonen Can Rager Johnny Lin Curt Tigges Joseph Isaac Bloom ... Matthew Wearden Arthur Conmy Arthur Conmy Samuel Marks Neel Nanda MU 164 23 0 12 Mar 2025
Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Lucy Farnik Tim Lawson Conor Houghton Laurence Aitchison 107 1 0 25 Feb 2025
FADE: Why Bad Descriptions Happen to Good Features Bruno Puri Aakriti Jain Elena Golimblevskaia Patrick Kahardipraja Thomas Wiegand Wojciech Samek Sebastian Lapuschkin 270 1 0 24 Feb 2025
Are Sparse Autoencoders Useful? A Case Study in Sparse Probing Subhash Kantamneni Joshua Engels Senthooran Rajamanoharan Max Tegmark Neel Nanda 141 17 0 23 Feb 2025