The Clock and the Pizza: Two Stories in Mechanistic Explanation of
Neural Networks

The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks

30 June 2023

Papers citing "The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks"

19 / 19 papers shown

Title
Low Rank and Sparse Fourier Structure in Recurrent Networks Trained on Modular Addition Akshay Rangamani 35 0 0 28 Mar 2025
(How) Do Language Models Track State? Belinda Z. Li Zifan Carl Guo Jacob Andreas LRM 44 0 0 04 Mar 2025
Feature Importance Depends on Properties of the Data: Towards Choosing the Correct Explanations for Your Data and Decision Trees based Models Célia Wafa Ayad Thomas Bonnier Benjamin Bosch Sonali Parbhoo Jesse Read FAtt XAI 87 0 0 11 Feb 2025
Out-of-distribution generalization via composition: a lens through induction heads in Transformers Jiajun Song Zhuoyan Xu Yiqiao Zhong 75 4 0 31 Dec 2024
Interpreting Arithmetic Mechanism in Large Language Models through Comparative Neuron Analysis Zeping Yu Sophia Ananiadou LRM MILM 22 6 0 21 Sep 2024
Representing Rule-based Chatbots with Transformers Dan Friedman Abhishek Panigrahi Danqi Chen 56 1 0 15 Jul 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 67 18 0 02 Jul 2024
Does ChatGPT Have a Mind? Simon Goldstein B. Levinstein AI4MH LRM 19 5 0 27 Jun 2024
Survival of the Fittest Representation: A Case Study with Modular Addition Xiaoman Delores Ding Zifan Carl Guo Eric J. Michaud Ziming Liu Max Tegmark 29 3 0 27 May 2024
KAN: Kolmogorov-Arnold Networks Ziming Liu Yixuan Wang Sachin Vaidya Fabian Ruehle James Halverson Marin Soljacic Thomas Y. Hou Max Tegmark 67 444 0 30 Apr 2024
Opening the AI black box: program synthesis via mechanistic interpretability Eric J. Michaud Isaac Liao Vedang Lad Ziming Liu Anish Mudide Chloe Loughridge Zifan Carl Guo Tara Rezaei Kheirkhah Mateja Vukelić Max Tegmark 18 12 0 07 Feb 2024
Black-Box Access is Insufficient for Rigorous AI Audits Stephen Casper Carson Ezell Charlotte Siegmann Noam Kolt Taylor Lynn Curtis ... Michael Gerovitch David Bau Max Tegmark David M. Krueger Dylan Hadfield-Menell AAML 13 75 0 25 Jan 2024
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods Fred Zhang Neel Nanda LLMSV 21 95 0 27 Sep 2023
It Ain't That Bad: Understanding the Mysterious Performance Drop in OOD Generalization for Generative Transformer Models Xingcheng Xu Zihao Pan Haipeng Zhang Yanqing Yang LRM 6 2 0 16 Aug 2023
Finding Neurons in a Haystack: Case Studies with Sparse Probing Wes Gurnee Neel Nanda Matthew Pauly Katherine Harvey Dmitrii Troitskii Dimitris Bertsimas MILM 153 186 0 02 May 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 210 486 0 01 Nov 2022
Omnigrok: Grokking Beyond Algorithmic Data Ziming Liu Eric J. Michaud Max Tegmark 54 76 0 03 Oct 2022
In-context Learning and Induction Heads Catherine Olsson Nelson Elhage Neel Nanda Nicholas Joseph Nova Dassarma ... Tom B. Brown Jack Clark Jared Kaplan Sam McCandlish C. Olah 240 453 0 24 Sep 2022
Toy Models of Superposition Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer T. Henighan ... Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg C. Olah AAML MILM 120 314 0 21 Sep 2022