v1v2 (latest)

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods

International Conference on Learning Representations (ICLR), 2023

27 September 2023

Fred Zhang

Neel Nanda

LLMSV

ArXiv (abs)PDF HTML HuggingFace (4 upvotes)

Papers citing "Towards Best Practices of Activation Patching in Language Models: Metrics and Methods"

27 / 127 papers shown

Title
No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks Chak Tou Leong Yi Cheng Kaishuai Xu Jian Wang Hanlin Wang Wenjie Li AAML 332 28 0 25 May 2024
Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization Boshi Wang Xiang Yue Yu-Chuan Su Huan Sun LRM 328 72 0 23 May 2024
Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models Charles OÑeill Thang Bui 178 12 0 21 May 2024
A Philosophical Introduction to Language Models - Part II: The Way Forward Raphael Milliere Cameron Buckner LRM 238 24 0 06 May 2024
How to use and interpret activation patching Stefan Heimersheim Neel Nanda 208 93 0 23 Apr 2024
Mechanistic Interpretability for AI Safety -- A Review Leonard Bereska E. Gavves AI4CE 316 288 0 22 Apr 2024
Decomposing and Editing Predictions by Modeling Model Computation Harshay Shah Andrew Ilyas Aleksander Madry KELM 270 23 0 17 Apr 2024
Finding Visual Task Vectors Alberto Hojel Yutong Bai Trevor Darrell Amir Globerson Amir Bar 228 14 0 08 Apr 2024
Locating and Editing Factual Associations in Mamba Arnab Sen Sharma David Atkinson David Bau KELM 210 37 0 04 Apr 2024
Unveiling LLMs: The Evolution of Latent Representations in a Temporal Knowledge Graph Marco Bronzini Carlo Nicolini Bruno Lepri Jacopo Staiano Baptiste Caramiaux KELM 164 0 0 04 Apr 2024
On Large Language Models' Hallucination with Regard to Known Facts Che Jiang Biqing Qi Xiangyu Hong Dayuan Fu Yang Cheng Fandong Meng Mo Yu Bowen Zhou Jie Zhou HILM LRM 236 42 0 29 Mar 2024
Localizing Paragraph Memorization in Language Models Niklas Stoehr Mitchell Gordon Chiyuan Zhang Owen Lewis MU 179 24 0 28 Mar 2024
Interpreting Key Mechanisms of Factual Recall in Transformer-Based Language Models Ang Lv Yuhan Chen Kaiyi Zhang Yulong Wang Lifeng Liu Ji-Rong Wen Jian Xie Rui Yan KELM 273 23 0 28 Mar 2024
Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms Michael Hanna Sandro Pezzelle Yonatan Belinkov 264 76 0 26 Mar 2024
Monotonic Representation of Numeric Properties in Language Models Benjamin Heinzerling Kentaro Inui KELM MILM 204 12 0 15 Mar 2024
Diffusion Lens: Interpreting Text Encoders in Text-to-Image PipelinesAnnual Meeting of the Association for Computational Linguistics (ACL), 2024 Michael Toker Hadas Orgad Mor Ventura Dana Arad Yonatan Belinkov DiffM 253 20 0 09 Mar 2024
The Heuristic Core: Understanding Subnetwork Generalization in Pretrained Language Models Adithya Bhaskar Dan Friedman Danqi Chen 337 9 0 06 Mar 2024
How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning Subhabrata Dutta Joykirat Singh Soumen Chakrabarti Tanmoy Chakraborty LRM 177 47 0 28 Feb 2024
Cutting Off the Head Ends the Conflict: A Mechanism for Interpreting and Mitigating Knowledge Conflicts in Language Models Zhuoran Jin Pengfei Cao Hongbang Yuan Yubo Chen Jiexin Xu Huaijun Li Xiaojian Jiang Kang Liu Jun Zhao 497 68 0 28 Feb 2024
Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT Zhengfu He Xuyang Ge Qiong Tang Tianxiang Sun Qinyuan Cheng Xipeng Qiu 199 25 0 19 Feb 2024
Learning Interpretable Concepts: Unifying Causal Representation Learning and Foundation Models Goutham Rajendran Simon Buchholz Bryon Aragam Bernhard Schölkopf Pradeep Ravikumar AI4CE 374 29 0 14 Feb 2024
Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language ModelsInternational Conference on Machine Learning (ICML), 2024 Asma Ghandeharioun Avi Caciularu Adam Pearce Lucas Dixon Mor Geva 599 156 0 11 Jan 2024
Neuron-Level Knowledge Attribution in Large Language Models Zeping Yu Sophia Ananiadou FAtt KELM 250 28 0 19 Dec 2023
Forbidden Facts: An Investigation of Competing Objectives in Llama-2 Tony T. Wang Miles Wang Kaivu Hariharan Nir Shavit 139 2 0 14 Dec 2023
An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4lBlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackboxNLP), 2023 James Dao Yeu-Tong Lau Can Rager Jett Janiak 315 5 0 11 Oct 2023
Polysemanticity and Capacity in Neural Networks Adam Scherlis Kshitij Sachan Adam Jermyn Joe Benton Buck Shlegeris MILM 528 48 0 04 Oct 2022
Discovering the Compositional Structure of Vector Representations with Role Learning NetworksBlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackBoxNLP), 2019 Paul Soulos R. Thomas McCoy Tal Linzen P. Smolensky CoGe 331 46 0 21 Oct 2019