Localizing Model Behavior with Path Patching

12 April 2023

Nicholas W. Goldowsky-Dill

Papers citing "Localizing Model Behavior with Path Patching"

25 / 75 papers shown

Title
Transformer Mechanisms Mimic Frontostriatal Gating Operations When Trained on Human Working Memory Tasks Aaron Traylor Jack Merullo Michael J. Frank Ellie Pavlick 32 6 0 13 Feb 2024
Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models Asma Ghandeharioun Avi Caciularu Adam Pearce Lucas Dixon Mor Geva 27 87 0 11 Jan 2024
Observable Propagation: Uncovering Feature Vectors in Transformers Jacob Dunefsky Arman Cohan 33 2 0 26 Dec 2023
Neuron-Level Knowledge Attribution in Large Language Models Zeping Yu Sophia Ananiadou FAtt KELM 19 6 0 19 Dec 2023
Successor Heads: Recurring, Interpretable Attention Heads In The Wild Rhys Gould Euan Ong George Ogden Arthur Conmy LRM 13 44 0 14 Dec 2023
Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models Alexandre Variengien Eric Winsor LRM ReLM 74 10 0 13 Dec 2023
Interpretability Illusions in the Generalization of Simplified Models Dan Friedman Andrew Kyle Lampinen Lucas Dixon Danqi Chen Asma Ghandeharioun 17 14 0 06 Dec 2023
Do Localization Methods Actually Localize Memorized Data in LLMs? A Tale of Two Benchmarks Ting-Yun Chang Jesse Thomason Robin Jia 15 14 0 15 Nov 2023
Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models Michael Lan Phillip H. S. Torr Fazl Barez LRM 30 2 0 07 Nov 2023
Circuit Component Reuse Across Tasks in Transformer Language Models Jack Merullo Carsten Eickhoff Ellie Pavlick 37 62 0 12 Oct 2023
An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l James Dao Yeu-Tong Lau Can Rager Jett Janiak 35 5 0 11 Oct 2023
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets Samuel Marks Max Tegmark HILM 102 168 0 10 Oct 2023
Copy Suppression: Comprehensively Understanding an Attention Head Callum McDougall Arthur Conmy Cody Rushing Thomas McGrath Neel Nanda MILM 23 41 0 06 Oct 2023
Discovering Knowledge-Critical Subnetworks in Pretrained Language Models Deniz Bayazit Negar Foroutan Zeming Chen Gail Weiss Antoine Bosselut KELM 24 13 0 04 Oct 2023
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods Fred Zhang Neel Nanda LLMSV 28 97 0 27 Sep 2023
Circuit Breaking: Removing Model Behaviors with Targeted Ablation Maximilian Li Xander Davies Max Nadeau KELM MU 16 27 0 12 Sep 2023
Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP Vedant Palit Rohan Pandey Aryaman Arora Paul Pu Liang 26 20 0 27 Aug 2023
Finding Neurons in a Haystack: Case Studies with Sparse Probing Wes Gurnee Neel Nanda Matthew Pauly Katherine Harvey Dmitrii Troitskii Dimitris Bertsimas MILM 155 186 0 02 May 2023
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model Michael Hanna Ollie Liu Alexandre Variengien LRM 189 119 0 30 Apr 2023
Towards Automated Circuit Discovery for Mechanistic Interpretability Arthur Conmy Augustine N. Mavor-Parker Aengus Lynch Stefan Heimersheim Adrià Garriga-Alonso 18 276 0 28 Apr 2023
Computational modeling of semantic change Nina Tahmasebi Haim Dubossarsky 26 6 0 13 Apr 2023
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations Atticus Geiger Zhengxuan Wu Christopher Potts Thomas F. Icard Noah D. Goodman CML 73 98 0 05 Mar 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 212 494 0 01 Nov 2022
Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers Yi Tay Mostafa Dehghani J. Rao W. Fedus Samira Abnar Hyung Won Chung Sharan Narang Dani Yogatama Ashish Vaswani Donald Metzler 198 110 0 22 Sep 2021
Shortformer: Better Language Modeling using Shorter Inputs Ofir Press Noah A. Smith M. Lewis 219 89 0 31 Dec 2020