How to use and interpret activation patching

23 April 2024

Papers citing "How to use and interpret activation patching"

11 / 11 papers shown

Title
Are We Paying Attention to Her? Investigating Gender Disambiguation and Attention in Machine Translation Chiara Manna Afra Alishahi Frédéric Blain Eva Vanmassenhove 22 0 0 13 May 2025
(How) Do Language Models Track State? Belinda Z. Li Zifan Carl Guo Jacob Andreas LRM 44 0 0 04 Mar 2025
Elucidating Mechanisms of Demographic Bias in LLMs for Healthcare Hiba Ahsan Arnab Sen Sharma Silvio Amir David Bau Byron C. Wallace 80 0 0 20 Feb 2025
Exploring Translation Mechanism of Large Language Models Hongbin Zhang Kehai Chen Xuefeng Bai Xiucheng Li Yang Xiang Min Zhang 57 1 0 17 Feb 2025
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models Javier Ferrando Oscar Obeso Senthooran Rajamanoharan Neel Nanda 75 10 0 21 Nov 2024
Racing Thoughts: Explaining Contextualization Errors in Large Language Models Michael A. Lepori Michael Mozer Asma Ghandeharioun LRM 80 1 0 02 Oct 2024
A Mechanistic Interpretation of Syllogistic Reasoning in Auto-Regressive Language Models Geonhee Kim Marco Valentino André Freitas LRM AI4CE 28 7 0 16 Aug 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 75 19 0 02 Jul 2024
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model Michael Hanna Ollie Liu Alexandre Variengien LRM 189 119 0 30 Apr 2023
Dissecting Recall of Factual Associations in Auto-Regressive Language Models Mor Geva Jasmijn Bastings Katja Filippova Amir Globerson KELM 189 261 0 28 Apr 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 210 494 0 01 Nov 2022