InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic
Interpretability Techniques

InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

19 July 2024

Iván Arcuschin

Adrià Garriga-Alonso

Papers citing "InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques"

12 / 12 papers shown

Title
Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii Kola Ayonrinde Louis Jaburi XAI 55 1 0 02 May 2025
The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks Lucius Bushnaq Stefan Heimersheim Nicholas Goldowsky-Dill Dan Braun Jake Mendel Kaarel Hänni Avery Griffin Jörn Stöhler Magdalena Wache Marius Hobbhahn FAtt 22 3 0 17 May 2024
AtP*: An efficient and scalable method for localizing LLM behaviour to components János Kramár Tom Lieberum Rohin Shah Neel Nanda KELM 41 42 0 01 Mar 2024
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations Jing-ling Huang Zhengxuan Wu Christopher Potts Mor Geva Atticus Geiger 53 24 0 27 Feb 2024
Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models Alexandre Variengien Eric Winsor LRM ReLM 72 10 0 13 Dec 2023
Attribution Patching Outperforms Automated Circuit Discovery Aaquib Syed Can Rager Arthur Conmy 55 53 0 16 Oct 2023
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model Michael Hanna Ollie Liu Alexandre Variengien LRM 178 116 0 30 Apr 2023
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations Atticus Geiger Zhengxuan Wu Christopher Potts Thomas F. Icard Noah D. Goodman CML 73 98 0 05 Mar 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 210 486 0 01 Nov 2022
Polysemanticity and Capacity in Neural Networks Adam Scherlis Kshitij Sachan Adam Jermyn Joe Benton Buck Shlegeris MILM 130 25 0 04 Oct 2022
In-context Learning and Induction Heads Catherine Olsson Nelson Elhage Neel Nanda Nicholas Joseph Nova Dassarma ... Tom B. Brown Jack Clark Jared Kaplan Sam McCandlish C. Olah 240 453 0 24 Sep 2022
Toy Models of Superposition Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer T. Henighan ... Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg C. Olah AAML MILM 117 314 0 21 Sep 2022