Title
On Linear Representations and Pretraining Data Frequency in Language Models Jack Merullo Noah A. Smith Sarah Wiegreffe Yanai Elazar 35 0 0 16 Apr 2025
Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders Luke Marks Alasdair Paren David M. Krueger Fazl Barez AAML 27 4 0 02 Nov 2024
SAGE: Scalable Ground Truth Evaluations for Large Sparse Autoencoders Constantin Venhoff Anisoara Calinescu Philip H. S. Torr Christian Schroeder de Witt 33 0 0 09 Oct 2024
Characterizing stable regions in the residual stream of LLMs Jett Janiak Jacek Karwowski Chatrik Singh Mangat Giorgi Giglemiani Nora Petrova Stefan Heimersheim 44 1 0 25 Sep 2024
TracrBench: Generating Interpretability Testbeds with Large Language Models Hannes Thurnherr Jérémy Scheurer 46 3 0 07 Sep 2024
The Mechanics of Conceptual Interpretation in GPT Models: Interpretative Insights Nura Aljaafari Danilo S. Carvalho André Freitas KELM 32 0 0 05 Aug 2024
Analyzing the Generalization and Reliability of Steering Vectors Daniel Tan David Chanin Aengus Lynch Dimitrios Kanoulas Brooks Paige Adrià Garriga-Alonso Robert Kirk LLMSV 84 16 0 17 Jul 2024
Functional Faithfulness in the Wild: Circuit Discovery with Differentiable Computation Graph Pruning Lei Yu Jingcheng Niu Zining Zhu Gerald Penn 36 5 0 04 Jul 2024
Sharing Matters: Analysing Neurons Across Languages and Tasks in LLMs Weixuan Wang Barry Haddow Wei Peng Alexandra Birch MILM 35 9 0 13 Jun 2024
Weight-based Decomposition: A Case for Bilinear MLPs Michael T. Pearce Thomas Dooms Alice Rigg 42 1 0 06 Jun 2024
Mechanistic Interpretability for AI Safety -- A Review Leonard Bereska E. Gavves AI4CE 40 111 0 22 Apr 2024
A singular Riemannian Geometry Approach to Deep Neural Networks III. Piecewise Differentiable Layers and Random Walks on $n$ -dimensional Classes A. Benfenati A. Marta 24 1 0 09 Apr 2024
Defining Neural Network Architecture through Polytope Structures of Dataset Sangmin Lee Abbas Mammadov Jong Chul Ye 56 0 0 04 Feb 2024
Explainable Artificial Intelligence (XAI) 2.0: A Manifesto of Open Challenges and Interdisciplinary Research Directions Luca Longo Mario Brcic Federico Cabitza Jaesik Choi Roberto Confalonieri ... Andrés Páez Wojciech Samek Johannes Schneider Timo Speith Simone Stumpf 29 189 0 30 Oct 2023
Neural Polytopes Koji Hashimoto T. Naito Hisashi Naito 22 1 0 03 Jul 2023
Finding Neurons in a Haystack: Case Studies with Sparse Probing Wes Gurnee Neel Nanda Matthew Pauly Katherine Harvey Dmitrii Troitskii Dimitris Bertsimas MILM 155 186 0 02 May 2023
Disentangling Neuron Representations with Concept Vectors Laura O'Mahony Vincent Andrearczyk Henning Muller Mara Graziani MILM 25 14 0 19 Apr 2023
Break It Down: Evidence for Structural Compositionality in Neural Networks Michael A. Lepori Thomas Serre Ellie Pavlick 33 29 0 26 Jan 2023
Toy Models of Superposition Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer T. Henighan ... Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg C. Olah AAML MILM 120 317 0 21 Sep 2022
Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks Tilman Raukur A. Ho Stephen Casper Dylan Hadfield-Menell AAML AI4CE 18 124 0 27 Jul 2022