An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l

11 October 2023

Papers citing "An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l"

4 / 4 papers shown

Title
Attribution Patching Outperforms Automated Circuit Discovery Aaquib Syed Can Rager Arthur Conmy 55 53 0 16 Oct 2023
Finding Neurons in a Haystack: Case Studies with Sparse Probing Wes Gurnee Neel Nanda Matthew Pauly Katherine Harvey Dmitrii Troitskii Dimitris Bertsimas MILM 153 186 0 02 May 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 210 486 0 01 Nov 2022
Natural Language Descriptions of Deep Visual Features Evan Hernandez Sarah Schwettmann David Bau Teona Bagashvili Antonio Torralba Jacob Andreas MILM 194 116 0 26 Jan 2022