Rigorously Assessing Natural Language Explanations of Neurons

Rigorously Assessing Natural Language Explanations of Neurons

19 September 2023

Jing-ling Huang

Karel DÓosterlinck

Christopher Potts

Papers citing "Rigorously Assessing Natural Language Explanations of Neurons"

12 / 12 papers shown

Title
HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks Jiuding Sun Jing Huang Sidharth Baskaran Karel DÓosterlinck Christopher Potts Michael Sklar Atticus Geiger AI4CE 60 0 0 13 Mar 2025
Discovering Influential Neuron Path in Vision Transformers Yifan Wang Yifei Liu Yingdong Shi C. Li Anqi Pang Sibei Yang Jingyi Yu Kan Ren ViT 69 0 0 12 Mar 2025
Mitigating Memorization In Language Models Mansi Sakarvadia Aswathy Ajith Arham Khan Nathaniel Hudson Caleb Geniesse Kyle Chard Yaoqing Yang Ian Foster Michael W. Mahoney KELM MU 50 0 0 03 Oct 2024
Natural Language Processing RELIES on Linguistics Juri Opitz Shira Wein Nathan Schneider AI4CE 44 7 0 09 May 2024
What does the Knowledge Neuron Thesis Have to do with Knowledge? Jingcheng Niu Andrew Liu Zining Zhu Gerald Penn 36 30 0 03 May 2024
A Multimodal Automated Interpretability Agent Tamar Rott Shaham Sarah Schwettmann Franklin Wang Achyuta Rajaram Evan Hernandez Jacob Andreas Antonio Torralba 26 17 0 22 Apr 2024
Dissecting Recall of Factual Associations in Auto-Regressive Language Models Mor Geva Jasmijn Bastings Katja Filippova Amir Globerson KELM 189 261 0 28 Apr 2023
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations Atticus Geiger Zhengxuan Wu Christopher Potts Thomas F. Icard Noah D. Goodman CML 73 98 0 05 Mar 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 210 491 0 01 Nov 2022
In-context Learning and Induction Heads Catherine Olsson Nelson Elhage Neel Nanda Nicholas Joseph Nova Dassarma ... Tom B. Brown Jack Clark Jared Kaplan Sam McCandlish C. Olah 240 456 0 24 Sep 2022
Toy Models of Superposition Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer T. Henighan ... Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg C. Olah AAML MILM 120 316 0 21 Sep 2022
Probing Classifiers: Promises, Shortcomings, and Advances Yonatan Belinkov 221 402 0 24 Feb 2021