Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2304.05969
Cited By
Localizing Model Behavior with Path Patching
12 April 2023
Nicholas W. Goldowsky-Dill
Chris MacLeod
L. Sato
Aryaman Arora
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Localizing Model Behavior with Path Patching"
25 / 75 papers shown
Title
Transformer Mechanisms Mimic Frontostriatal Gating Operations When Trained on Human Working Memory Tasks
Aaron Traylor
Jack Merullo
Michael J. Frank
Ellie Pavlick
32
6
0
13 Feb 2024
Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models
Asma Ghandeharioun
Avi Caciularu
Adam Pearce
Lucas Dixon
Mor Geva
27
87
0
11 Jan 2024
Observable Propagation: Uncovering Feature Vectors in Transformers
Jacob Dunefsky
Arman Cohan
33
2
0
26 Dec 2023
Neuron-Level Knowledge Attribution in Large Language Models
Zeping Yu
Sophia Ananiadou
FAtt
KELM
19
6
0
19 Dec 2023
Successor Heads: Recurring, Interpretable Attention Heads In The Wild
Rhys Gould
Euan Ong
George Ogden
Arthur Conmy
LRM
13
44
0
14 Dec 2023
Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models
Alexandre Variengien
Eric Winsor
LRM
ReLM
74
10
0
13 Dec 2023
Interpretability Illusions in the Generalization of Simplified Models
Dan Friedman
Andrew Kyle Lampinen
Lucas Dixon
Danqi Chen
Asma Ghandeharioun
17
14
0
06 Dec 2023
Do Localization Methods Actually Localize Memorized Data in LLMs? A Tale of Two Benchmarks
Ting-Yun Chang
Jesse Thomason
Robin Jia
15
14
0
15 Nov 2023
Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models
Michael Lan
Phillip H. S. Torr
Fazl Barez
LRM
30
2
0
07 Nov 2023
Circuit Component Reuse Across Tasks in Transformer Language Models
Jack Merullo
Carsten Eickhoff
Ellie Pavlick
37
62
0
12 Oct 2023
An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l
James Dao
Yeu-Tong Lau
Can Rager
Jett Janiak
35
5
0
11 Oct 2023
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
Samuel Marks
Max Tegmark
HILM
102
168
0
10 Oct 2023
Copy Suppression: Comprehensively Understanding an Attention Head
Callum McDougall
Arthur Conmy
Cody Rushing
Thomas McGrath
Neel Nanda
MILM
23
41
0
06 Oct 2023
Discovering Knowledge-Critical Subnetworks in Pretrained Language Models
Deniz Bayazit
Negar Foroutan
Zeming Chen
Gail Weiss
Antoine Bosselut
KELM
24
13
0
04 Oct 2023
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
Fred Zhang
Neel Nanda
LLMSV
28
97
0
27 Sep 2023
Circuit Breaking: Removing Model Behaviors with Targeted Ablation
Maximilian Li
Xander Davies
Max Nadeau
KELM
MU
16
27
0
12 Sep 2023
Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP
Vedant Palit
Rohan Pandey
Aryaman Arora
Paul Pu Liang
26
20
0
27 Aug 2023
Finding Neurons in a Haystack: Case Studies with Sparse Probing
Wes Gurnee
Neel Nanda
Matthew Pauly
Katherine Harvey
Dmitrii Troitskii
Dimitris Bertsimas
MILM
155
186
0
02 May 2023
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model
Michael Hanna
Ollie Liu
Alexandre Variengien
LRM
189
119
0
30 Apr 2023
Towards Automated Circuit Discovery for Mechanistic Interpretability
Arthur Conmy
Augustine N. Mavor-Parker
Aengus Lynch
Stefan Heimersheim
Adrià Garriga-Alonso
18
276
0
28 Apr 2023
Computational modeling of semantic change
Nina Tahmasebi
Haim Dubossarsky
26
6
0
13 Apr 2023
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
Atticus Geiger
Zhengxuan Wu
Christopher Potts
Thomas F. Icard
Noah D. Goodman
CML
73
98
0
05 Mar 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
212
494
0
01 Nov 2022
Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers
Yi Tay
Mostafa Dehghani
J. Rao
W. Fedus
Samira Abnar
Hyung Won Chung
Sharan Narang
Dani Yogatama
Ashish Vaswani
Donald Metzler
198
110
0
22 Sep 2021
Shortformer: Better Language Modeling using Shorter Inputs
Ofir Press
Noah A. Smith
M. Lewis
219
89
0
31 Dec 2020
Previous
1
2