Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2211.00593
Cited By
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
1 November 2022
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small"
15 / 15 papers shown
Title
Geospatial Mechanistic Interpretability of Large Language Models
Stef De Sabbata
Stefano Mizzaro
Kevin Roitero
AI4CE
15
63
0
06 May 2025
Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii
Kola Ayonrinde
Louis Jaburi
XAI
21
77
0
02 May 2025
A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i
Kola Ayonrinde
Louis Jaburi
MILM
47
124
0
01 May 2025
Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition
Zhengfu He
J. Wang
Rui Lin
Xuyang Ge
Wentao Shu
Qiong Tang
J. Zhang
Xipeng Qiu
53
29
0
29 Apr 2025
Model Connectomes: A Generational Approach to Data-Efficient Language Models
Klemen Kotar
Greta Tuckute
33
60
0
29 Apr 2025
Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video
Sonia Joseph
Praneet Suresh
Lorenz Hufe
Edward Stevinson
Robert Graham
Yash Vadi
Danilo Bzdok
Sebastian Lapuschkin
Lee Sharkey
Blake A. Richards
54
34
0
28 Apr 2025
Improving Reasoning Performance in Large Language Models via Representation Engineering
Bertram Højer
Oliver Jarvis
Stefan Heinrich
LRM
53
16
0
28 Apr 2025
Studying Small Language Models with Susceptibilities
Garrett Baker
George Wang
Jesse Hoogland
Daniel Murfet
AAML
65
29
0
25 Apr 2025
Do Large Language Models know who did what to whom?
Joseph M. Denning
Xiaohan
Bryor Snefjella
Idan A. Blank
33
91
0
23 Apr 2025
Bigram Subnetworks: Mapping to Next Tokens in Transformer Language Models
Tyler A. Chang
Benjamin Bergen
30
40
0
21 Apr 2025
Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models
Thomas Winninger
Boussad Addad
Katarzyna Kapusta
AAML
46
54
0
08 Mar 2025
Towards Interpreting Visual Information Processing in Vision-Language Models
Clement Neo
Luke Ong
Philip H. S. Torr
Mor Geva
David M. Krueger
Fazl Barez
64
45
0
09 Oct 2024
Racing Thoughts: Explaining Contextualization Errors in Large Language Models
Michael A. Lepori
Michael Mozer
Asma Ghandeharioun
LRM
53
38
0
02 Oct 2024
In-context Learning and Induction Heads
Catherine Olsson
Nelson Elhage
Neel Nanda
Nicholas Joseph
Nova Dassarma
...
Tom B. Brown
Jack Clark
Jared Kaplan
Sam McCandlish
C. Olah
226
326
0
24 Sep 2022
Natural Language Descriptions of Deep Visual Features
Evan Hernandez
Sarah Schwettmann
David Bau
Teona Bagashvili
Antonio Torralba
Jacob Andreas
MILM
174
92
0
26 Jan 2022
1