Title
On Language Models' Sensitivity to Suspicious Coincidences Sriram Padmanabhan Kanishka Misra Kyle Mahowald Eunsol Choi ReLM LRM 30 0 0 13 Apr 2025
HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks Jiuding Sun Jing Huang Sidharth Baskaran Karel DÓosterlinck Christopher Potts Michael Sklar Atticus Geiger AI4CE 55 0 0 13 Mar 2025
Building Bridges, Not Walls -- Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution Shichang Zhang Tessa Han Usha Bhalla Hima Lakkaraju FAtt 143 0 0 17 Feb 2025
Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages Jannik Brinkmann Chris Wendler Christian Bartelt Aaron Mueller 41 9 0 10 Jan 2025
The Geometry of Concepts: Sparse Autoencoder Feature Structure Yuxiao Li Eric J. Michaud David D. Baek Joshua Engels Xiaoqing Sun Max Tegmark 30 7 0 10 Oct 2024
Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models Adam Karvonen Benjamin Wright Can Rager Rico Angell Jannik Brinkmann Logan Smith C. M. Verdun David Bau Samuel Marks 25 26 0 31 Jul 2024
Talking Heads: Understanding Inter-layer Communication in Transformer Language Models Jack Merullo Carsten Eickhoff Ellie Pavlick 38 2 0 13 Jun 2024
AtP*: An efficient and scalable method for localizing LLM behaviour to components János Kramár Tom Lieberum Rohin Shah Neel Nanda KELM 36 40 0 01 Mar 2024
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations Jing-ling Huang Zhengxuan Wu Christopher Potts Mor Geva Atticus Geiger 48 24 0 27 Feb 2024
Rethinking Interpretability in the Era of Large Language Models Chandan Singh J. Inala Michel Galley Rich Caruana Jianfeng Gao LRM AI4CE 71 59 0 30 Jan 2024
Uncovering Intermediate Variables in Transformers using Circuit Probing Michael A. Lepori Thomas Serre Ellie Pavlick 49 7 0 07 Nov 2023
Attribution Patching Outperforms Automated Circuit Discovery Aaquib Syed Can Rager Arthur Conmy 50 53 0 16 Oct 2023
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets Samuel Marks Max Tegmark HILM 91 164 0 10 Oct 2023
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations Atticus Geiger Zhengxuan Wu Christopher Potts Thomas F. Icard Noah D. Goodman CML 73 98 0 05 Mar 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 207 486 0 01 Nov 2022
In-context Learning and Induction Heads Catherine Olsson Nelson Elhage Neel Nanda Nicholas Joseph Nova Dassarma ... Tom B. Brown Jack Clark Jared Kaplan Sam McCandlish C. Olah 234 453 0 24 Sep 2022
Towards Faithful Model Explanation in NLP: A Survey Qing Lyu Marianna Apidianaki Chris Callison-Burch XAI 101 105 0 22 Sep 2022
Toy Models of Superposition Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer T. Henighan ... Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg C. Olah AAML MILM 117 314 0 21 Sep 2022
Naturalistic Causal Probing for Morpho-Syntax Afra Amini Tiago Pimentel Clara Meister Ryan Cotterell MILM 93 18 0 14 May 2022
Can RNNs learn Recursive Nested Subject-Verb Agreements? Yair Lakretz T. Desbordes J. King Benoît Crabbé Maxime Oquab S. Dehaene 155 19 0 06 Jan 2021
Scaling Laws for Neural Language Models Jared Kaplan Sam McCandlish T. Henighan Tom B. Brown B. Chess R. Child Scott Gray Alec Radford Jeff Wu Dario Amodei 220 3,054 0 23 Jan 2020
What you can cram into a single vector: Probing sentence embeddings for linguistic properties Alexis Conneau Germán Kruszewski Guillaume Lample Loïc Barrault Marco Baroni 196 876 0 03 May 2018
Efficient Estimation of Word Representations in Vector Space Tomáš Mikolov Kai Chen G. Corrado J. Dean 3DV 228 29,632 0 16 Jan 2013