Cutting Off the Head Ends the Conflict: A Mechanism for Interpreting and
Mitigating Knowledge Conflicts in Language Models

Cutting Off the Head Ends the Conflict: A Mechanism for Interpreting and Mitigating Knowledge Conflicts in Language Models

28 February 2024

Kang Liu

Jun Zhao

Papers citing "Cutting Off the Head Ends the Conflict: A Mechanism for Interpreting and Mitigating Knowledge Conflicts in Language Models"

10 / 10 papers shown

Title
Correcting Negative Bias in Large Language Models through Negative Attention Score Alignment Sangwon Yu Jongyoon Song Bongkyu Hwang Hoyoung Kang Sooah Cho Junhwa Choi Seongho Joe Taehee Lee Youngjune Gwon Sungroh Yoon 69 4 0 31 Jul 2024
Knowledge Circuits in Pretrained Transformers Yunzhi Yao Ningyu Zhang Zekun Xi Meng Wang Ziwen Xu Shumin Deng Huajun Chen KELM 36 19 0 28 May 2024
Tug-of-War Between Knowledge: Exploring and Resolving Knowledge Conflicts in Retrieval-Augmented Language Models Zhuoran Jin Pengfei Cao Yubo Chen Kang Liu Xiaojian Jiang Jiexin Xu Qiuxia Li Jun Zhao 179 10 0 22 Feb 2024
Attention Lens: A Tool for Mechanistically Interpreting the Attention Head Information Retrieval Mechanism Mansi Sakarvadia Arham Khan Aswathy Ajith Daniel Grzenda Nathaniel Hudson André Bauer Kyle Chard Ian T. Foster 164 6 0 25 Oct 2023
Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts Jian Xie Kai Zhang Jiangjie Chen Renze Lou Yu-Chuan Su RALM 198 75 0 22 May 2023
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model Michael Hanna Ollie Liu Alexandre Variengien LRM 167 116 0 30 Apr 2023
Dissecting Recall of Factual Associations in Auto-Regressive Language Models Mor Geva Jasmijn Bastings Katja Filippova Amir Globerson KELM 180 152 0 28 Apr 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 205 486 0 01 Nov 2022
Entity-Based Knowledge Conflicts in Question Answering Shayne Longpre Kartik Perisetla Anthony Chen Nikhil Ramesh Chris DuBois Sameer Singh HILM 230 177 0 10 Sep 2021
Consistent Accelerated Inference via Confident Adaptive Transformers Tal Schuster Adam Fisch Tommi Jaakkola Regina Barzilay AI4TS 177 60 0 18 Apr 2021