Interpretability in the Wild: a Circuit for Indirect Object
Identification in GPT-2 small

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

1 November 2022

Alexandre Variengien

Jacob Steinhardt

Papers citing "Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small"

15 / 15 papers shown

Title
Geospatial Mechanistic Interpretability of Large Language Models Stef De Sabbata Stefano Mizzaro Kevin Roitero AI4CE 15 63 0 06 May 2025
Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii Kola Ayonrinde Louis Jaburi XAI 21 77 0 02 May 2025
A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i Kola Ayonrinde Louis Jaburi MILM 47 124 0 01 May 2025
Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition Zhengfu He J. Wang Rui Lin Xuyang Ge Wentao Shu Qiong Tang J. Zhang Xipeng Qiu 53 29 0 29 Apr 2025
Model Connectomes: A Generational Approach to Data-Efficient Language Models Klemen Kotar Greta Tuckute 33 60 0 29 Apr 2025
Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video Sonia Joseph Praneet Suresh Lorenz Hufe Edward Stevinson Robert Graham Yash Vadi Danilo Bzdok Sebastian Lapuschkin Lee Sharkey Blake A. Richards 54 34 0 28 Apr 2025
Improving Reasoning Performance in Large Language Models via Representation Engineering Bertram Højer Oliver Jarvis Stefan Heinrich LRM 53 16 0 28 Apr 2025
Studying Small Language Models with Susceptibilities Garrett Baker George Wang Jesse Hoogland Daniel Murfet AAML 65 29 0 25 Apr 2025
Do Large Language Models know who did what to whom? Joseph M. Denning Xiaohan Bryor Snefjella Idan A. Blank 33 91 0 23 Apr 2025
Bigram Subnetworks: Mapping to Next Tokens in Transformer Language Models Tyler A. Chang Benjamin Bergen 30 40 0 21 Apr 2025
Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models Thomas Winninger Boussad Addad Katarzyna Kapusta AAML 46 54 0 08 Mar 2025
Towards Interpreting Visual Information Processing in Vision-Language Models Clement Neo Luke Ong Philip H. S. Torr Mor Geva David M. Krueger Fazl Barez 64 45 0 09 Oct 2024
Racing Thoughts: Explaining Contextualization Errors in Large Language Models Michael A. Lepori Michael Mozer Asma Ghandeharioun LRM 53 38 0 02 Oct 2024
In-context Learning and Induction Heads Catherine Olsson Nelson Elhage Neel Nanda Nicholas Joseph Nova Dassarma ... Tom B. Brown Jack Clark Jared Kaplan Sam McCandlish C. Olah 226 326 0 24 Sep 2022
Natural Language Descriptions of Deep Visual Features Evan Hernandez Sarah Schwettmann David Bau Teona Bagashvili Antonio Torralba Jacob Andreas MILM 174 92 0 26 Jan 2022