Title
Towards Uncovering How Large Language Model Works: An Explainability Perspective Haiyan Zhao Fan Yang Bo Shen Himabindu Lakkaraju Mengnan Du 30 10 0 16 Feb 2024
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 207 486 0 01 Nov 2022
Omnigrok: Grokking Beyond Algorithmic Data Ziming Liu Eric J. Michaud Max Tegmark 54 76 0 03 Oct 2022
In-context Learning and Induction Heads Catherine Olsson Nelson Elhage Neel Nanda Nicholas Joseph Nova Dassarma ... Tom B. Brown Jack Clark Jared Kaplan Sam McCandlish C. Olah 237 453 0 24 Sep 2022
Toy Models of Superposition Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer T. Henighan ... Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg C. Olah AAML MILM 117 314 0 21 Sep 2022
Probing Classifiers: Promises, Shortcomings, and Advances Yonatan Belinkov 221 291 0 24 Feb 2021
Scaling Laws for Neural Language Models Jared Kaplan Sam McCandlish T. Henighan Tom B. Brown B. Chess R. Child Scott Gray Alec Radford Jeff Wu Dario Amodei 220 3,054 0 23 Jan 2020
Towards A Rigorous Science of Interpretable Machine Learning Finale Doshi-Velez Been Kim XAI FaML 219 3,658 0 28 Feb 2017
Demixed principal component analysis of population activity in higher cortical areas reveals independent representation of task parameters D. Kobak Wieland Brendel C. Constantinidis C. Feierstein Adam Kepecs Z. Mainen R. Romo Xue-Lian Qi N. Uchida C. Machens 27 462 0 22 Oct 2014