Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic
Interpretability: A Case Study on Othello-GPT

Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT

19 February 2024

Tianxiang Sun

Xipeng Qiu

Papers citing "Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT"

8 / 8 papers shown

Title
Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition Zhengfu He J. Wang Rui Lin Xuyang Ge Wentao Shu Qiong Tang J. Zhang Xipeng Qiu 68 0 0 29 Apr 2025
Mechanistic Unveiling of Transformer Circuits: Self-Influence as a Key to Model Reasoning L. Zhang Lijie Hu Di Wang LRM 81 0 0 17 Feb 2025
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words Gouki Minegishi Hiroki Furuta Yusuke Iwasawa Y. Matsuo 40 1 0 09 Jan 2025
Transformers Use Causal World Models in Maze-Solving Tasks Alex F Spies William Edwards Michael I. Ivanitskiy Adrians Skapars Tilman Rauker Katsumi Inoue A. Russo Murray Shanahan 84 1 0 16 Dec 2024
Knowledge Circuits in Pretrained Transformers Yunzhi Yao Ningyu Zhang Zekun Xi Meng Wang Ziwen Xu Shumin Deng Huajun Chen KELM 45 19 0 28 May 2024
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 205 486 0 01 Nov 2022
In-context Learning and Induction Heads Catherine Olsson Nelson Elhage Neel Nanda Nicholas Joseph Nova Dassarma ... Tom B. Brown Jack Clark Jared Kaplan Sam McCandlish C. Olah 234 453 0 24 Sep 2022
Toy Models of Superposition Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer T. Henighan ... Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg C. Olah AAML MILM 117 314 0 21 Sep 2022