Localizing Model Behavior with Path Patching

12 April 2023

Nicholas W. Goldowsky-Dill

Papers citing "Localizing Model Behavior with Path Patching"

50 / 75 papers shown

Title
Rethinking Circuit Completeness in Language Models: AND, OR, and ADDER Gates Hang Chen Jiaying Zhu Xinyu Yang Wenya Wang LRM 9 0 0 15 May 2025
Scaling sparse feature circuit finding for in-context learning Dmitrii Kharlapenko S. Kamath S Fazl Barez Arthur Conmy Neel Nanda 26 0 0 18 Apr 2025
Towards Quantifying Commonsense Reasoning with Mechanistic Insights Abhinav Joshi A. Ahmad Divyaksh Shukla Ashutosh Modi ReLM LRM 34 0 0 14 Apr 2025
Combining Causal Models for More Accurate Abstractions of Neural Networks Theodora-Mara Pîslar Sara Magliacane Atticus Geiger AI4CE 50 0 0 14 Mar 2025
Promote, Suppress, Iterate: How Language Models Answer One-to-Many Factual Queries Tianyi Lorena Yan Robin Jia KELM MU 46 0 0 27 Feb 2025
Neuroplasticity and Corruption in Model Mechanisms: A Case Study Of Indirect Object Identification Vishnu Kabir Chhabra Ding Zhu Mohammad Mahdi Khalili 37 2 0 27 Feb 2025
An explainable transformer circuit for compositional generalization Cheng Tang Brenden Lake Mehrdad Jazayeri LRM 39 0 0 19 Feb 2025
Exploring Translation Mechanism of Large Language Models Hongbin Zhang Kehai Chen Xuefeng Bai Xiucheng Li Yang Xiang Min Zhang 59 1 0 17 Feb 2025
An Attempt to Unraveling Token Prediction Refinement and Identifying Essential Layers of Large Language Models Jaturong Kongmanee 34 1 0 28 Jan 2025
Controllable Context Sensitivity and the Knob Behind It Julian Minder Kevin Du Niklas Stoehr Giovanni Monea Chris Wendler Robert West Ryan Cotterell KELM 44 3 0 11 Nov 2024
Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control Yuxin Xiao Chaoqun Wan Yonggang Zhang Wenxiao Wang Binbin Lin Xiaofei He Xu Shen Jieping Ye 24 0 0 04 Nov 2024
CogSteer: Cognition-Inspired Selective Layer Intervention for Efficient Semantic Steering in Large Language Models Xintong Wang Jingheng Pan Longqin Jiang Liang Ding Xingshan Li Chris Biemann LLMSV 29 0 0 23 Oct 2024
Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization Phillip Guo Aaquib Syed Abhay Sheshadri Aidan Ewart Gintare Karolina Dziugaite KELM MU 31 5 0 16 Oct 2024
Locking Down the Finetuned LLMs Safety Minjun Zhu Linyi Yang Yifan Wei Ningyu Zhang Yue Zhang 34 8 0 14 Oct 2024
The Same But Different: Structural Similarities and Differences in Multilingual Language Modeling Ruochen Zhang Qinan Yu Matianyu Zang Carsten Eickhoff Ellie Pavlick 45 1 0 11 Oct 2024
Jet Expansions of Residual Computation Yihong Chen Xiangxiang Xu Yao Lu Pontus Stenetorp Luca Franceschi 28 2 0 08 Oct 2024
How Language Models Prioritize Contextual Grammatical Cues? Hamidreza Amirzadeh A. Alishahi Hosein Mohebbi 21 0 0 04 Oct 2024
Differentiation and Specialization of Attention Heads via the Refined Local Learning Coefficient George Wang Jesse Hoogland Stan van Wingerden Zach Furman Daniel Murfet OffRL 18 7 0 03 Oct 2024
Sparse Attention Decomposition Applied to Circuit Tracing Gabriel Franco Mark Crovella 31 0 0 01 Oct 2024
Optimal ablation for interpretability Maximilian Li Lucas Janson FAtt 44 2 0 16 Sep 2024
Interpreting and Improving Large Language Models in Arithmetic Calculation Wei Zhang Chaoqun Wan Yonggang Zhang Yiu-ming Cheung Xinmei Tian Xu Shen Jieping Ye LRM 24 18 0 03 Sep 2024
Multimodal Contrastive In-Context Learning Yosuke Miyanishi Minh Le Nguyen 32 2 0 23 Aug 2024
Personality Alignment of Large Language Models Minjun Zhu Linyi Yang Yue Zhang Yue Zhang ALM 57 5 0 21 Aug 2024
The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability Aaron Mueller Jannik Brinkmann Millicent Li Samuel Marks Koyena Pal ... Arnab Sen Sharma Jiuding Sun Eric Todd David Bau Yonatan Belinkov CML 42 18 0 02 Aug 2024
Transformers on Markov Data: Constant Depth Suffices Nived Rajaraman Marco Bondaschi Kannan Ramchandran Michael C. Gastpar Ashok Vardhan Makkuva 37 4 0 25 Jul 2024
Knowledge Mechanisms in Large Language Models: A Survey and Perspective Meng Wang Yunzhi Yao Ziwen Xu Shuofei Qiao Shumin Deng ... Yong-jia Jiang Pengjun Xie Fei Huang Huajun Chen Ningyu Zhang 47 28 0 22 Jul 2024
LLM Circuit Analyses Are Consistent Across Training and Scale Curt Tigges Michael Hanna Qinan Yu Stella Biderman 31 10 0 15 Jul 2024
Transformer Circuit Faithfulness Metrics are not Robust Joseph Miller Bilal Chughtai William Saunders 45 7 0 11 Jul 2024
Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks Aaron Mueller CML 28 10 0 05 Jul 2024
Functional Faithfulness in the Wild: Circuit Discovery with Differentiable Computation Graph Pruning Lei Yu Jingcheng Niu Zining Zhu Gerald Penn 36 5 0 04 Jul 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 75 19 0 02 Jul 2024
Interpreting Attention Layer Outputs with Sparse Autoencoders Connor Kissane Robert Krzyzanowski Joseph Isaac Bloom Arthur Conmy Neel Nanda MILM 26 17 0 25 Jun 2024
Transformer Normalisation Layers and the Independence of Semantic Subspaces S. Menary Samuel Kaski Andre Freitas 44 2 0 25 Jun 2024
Finding Transformer Circuits with Edge Pruning Adithya Bhaskar Alexander Wettig Dan Friedman Danqi Chen 58 16 0 24 Jun 2024
When Parts are Greater Than Sums: Individual LLM Components Can Outperform Full Models Ting-Yun Chang Jesse Thomason Robin Jia 40 4 0 19 Jun 2024
Transcoders Find Interpretable LLM Feature Circuits Jacob Dunefsky Philippe Chlenski Neel Nanda 22 21 0 17 Jun 2024
Talking Heads: Understanding Inter-layer Communication in Transformer Language Models Jack Merullo Carsten Eickhoff Ellie Pavlick 56 13 0 13 Jun 2024
Knowledge Circuits in Pretrained Transformers Yunzhi Yao Ningyu Zhang Zekun Xi Meng Wang Ziwen Xu Shumin Deng Huajun Chen KELM 64 20 0 28 May 2024
InversionView: A General-Purpose Method for Reading Information from Neural Activations Xinting Huang Madhur Panwar Navin Goyal Michael Hahn 26 3 0 27 May 2024
Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models Charles OÑeill Thang Bui 30 5 0 21 May 2024
The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks Lucius Bushnaq Stefan Heimersheim Nicholas Goldowsky-Dill Dan Braun Jake Mendel Kaarel Hänni Avery Griffin Jörn Stöhler Magdalena Wache Marius Hobbhahn FAtt 33 3 0 17 May 2024
How to use and interpret activation patching Stefan Heimersheim Neel Nanda 30 37 0 23 Apr 2024
Automatic Discovery of Visual Circuits Achyuta Rajaram Neil Chowdhury Antonio Torralba Jacob Andreas Sarah Schwettmann GNN 24 3 0 22 Apr 2024
Mechanistic Interpretability for AI Safety -- A Review Leonard Bereska E. Gavves AI4CE 40 111 0 22 Apr 2024
pyvene: A Library for Understanding and Improving PyTorch Models via Interventions Zhengxuan Wu Atticus Geiger Aryaman Arora Jing-ling Huang Zheng Wang Noah D. Goodman Christopher D. Manning Christopher Potts MU 44 25 0 12 Mar 2024
AtP*: An efficient and scalable method for localizing LLM behaviour to components János Kramár Tom Lieberum Rohin Shah Neel Nanda KELM 43 42 0 01 Mar 2024
Cutting Off the Head Ends the Conflict: A Mechanism for Interpreting and Mitigating Knowledge Conflicts in Language Models Zhuoran Jin Pengfei Cao Hongbang Yuan Yubo Chen Jiexin Xu Huaijun Li Xiaojian Jiang Kang Liu Jun Zhao 180 34 0 28 Feb 2024
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations Jing-ling Huang Zhengxuan Wu Christopher Potts Mor Geva Atticus Geiger 57 26 0 27 Feb 2024
Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking Nikhil Prakash Tamar Rott Shaham Tal Haklay Yonatan Belinkov David Bau 41 52 0 22 Feb 2024
CausalGym: Benchmarking causal interpretability methods on linguistic tasks Aryaman Arora Daniel Jurafsky Christopher Potts 50 21 0 19 Feb 2024