Interpretability in the Wild: a Circuit for Indirect Object
Identification in GPT-2 small

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

1 November 2022

Alexandre Variengien

Jacob Steinhardt

Papers citing "Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small"

12 / 12 papers shown

Title
A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i Kola Ayonrinde Louis Jaburi MILM 10 0 0 01 May 2025
Model Connectomes: A Generational Approach to Data-Efficient Language Models Klemen Kotar Greta Tuckute 26 0 0 29 Apr 2025
Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition Zhengfu He J. Wang Rui Lin Xuyang Ge Wentao Shu Qiong Tang J. Zhang Xipeng Qiu 34 0 0 29 Apr 2025
Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video Sonia Joseph Praneet Suresh Lorenz Hufe Edward Stevinson Robert Graham Yash Vadi Danilo Bzdok Sebastian Lapuschkin Lee Sharkey Blake A. Richards 38 34 0 28 Apr 2025
Improving Reasoning Performance in Large Language Models via Representation Engineering Bertram Højer Oliver Jarvis Stefan Heinrich LRM 29 16 0 28 Apr 2025
Studying Small Language Models with Susceptibilities Garrett Baker George Wang Jesse Hoogland Daniel Murfet AAML 58 0 0 25 Apr 2025
Do Large Language Models know who did what to whom? Joseph M. Denning Xiaohan Bryor Snefjella Idan A. Blank 22 91 0 23 Apr 2025
Bigram Subnetworks: Mapping to Next Tokens in Transformer Language Models Tyler A. Chang Benjamin Bergen 21 0 0 21 Apr 2025
Towards Interpreting Visual Information Processing in Vision-Language Models Clement Neo Luke Ong Philip H. S. Torr Mor Geva David M. Krueger Fazl Barez 48 6 0 09 Oct 2024
Racing Thoughts: Explaining Contextualization Errors in Large Language Models Michael A. Lepori Michael Mozer Asma Ghandeharioun LRM 40 1 0 02 Oct 2024
In-context Learning and Induction Heads Catherine Olsson Nelson Elhage Neel Nanda Nicholas Joseph Nova Dassarma ... Tom B. Brown Jack Clark Jared Kaplan Sam McCandlish C. Olah 208 326 0 24 Sep 2022
Natural Language Descriptions of Deep Visual Features Evan Hernandez Sarah Schwettmann David Bau Teona Bagashvili Antonio Torralba Jacob Andreas MILM 162 92 0 26 Jan 2022