Finding Alignments Between Interpretable Causal Variables and
Distributed Neural Representations

Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations

5 March 2023

Christopher Potts

Thomas F. Icard

Noah D. Goodman

Papers citing "Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations"

14 / 14 papers shown

Title
HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks Jiuding Sun Jing Huang Sidharth Baskaran Karel DÓosterlinck Christopher Potts Michael Sklar Atticus Geiger AI4CE 55 0 0 13 Mar 2025
What is causal about causal models and representations? Frederik Hytting Jørgensen Luigi Gresele S. Weichwald CML 95 0 0 31 Jan 2025
ConTrans: Weak-to-Strong Alignment Engineering via Concept Transplantation Weilong Dong Xinwei Wu Renren Jin Shaoyang Xu Deyi Xiong 33 6 0 31 Dec 2024
JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit Zeqing He Zhibo Wang Zhixuan Chu Huiyu Xu Rui Zheng Kui Ren Chun Chen 21 3 0 17 Nov 2024
Mitigating Memorization In Language Models Mansi Sakarvadia Aswathy Ajith Arham Khan Nathaniel Hudson Caleb Geniesse Kyle Chard Yaoqing Yang Ian Foster Michael W. Mahoney KELM MU 23 0 0 03 Oct 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 34 18 0 02 Jul 2024
Finding Transformer Circuits with Edge Pruning Adithya Bhaskar Alexander Wettig Dan Friedman Danqi Chen 31 14 0 24 Jun 2024
Uncovering Intermediate Variables in Transformers using Circuit Probing Michael A. Lepori Thomas Serre Ellie Pavlick 28 7 0 07 Nov 2023
A Geometric Notion of Causal Probing Clément Guerner Anej Svete Tianyu Liu Alex Warstadt Ryan Cotterell LLMSV 16 12 0 27 Jul 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 205 295 0 01 Nov 2022
Causal Proxy Models for Concept-Based Model Explanations Zhengxuan Wu Karel DÓosterlinck Atticus Geiger Amir Zur Christopher Potts MILM 60 29 0 28 Sep 2022
In-context Learning and Induction Heads Catherine Olsson Nelson Elhage Neel Nanda Nicholas Joseph Nova Dassarma ... Tom B. Brown Jack Clark Jared Kaplan Sam McCandlish C. Olah 232 326 0 24 Sep 2022
Towards Faithful Model Explanation in NLP: A Survey Qing Lyu Marianna Apidianaki Chris Callison-Burch XAI 92 105 0 22 Sep 2022
Linear Adversarial Concept Erasure Shauli Ravfogel Michael Twiton Yoav Goldberg Ryan Cotterell KELM 54 56 0 28 Jan 2022