Mechanistic Interpretability for AI Safety -- A Review

22 April 2024

Papers citing "Mechanistic Interpretability for AI Safety -- A Review"

12 / 62 papers shown

Title
Polysemanticity and Capacity in Neural Networks Adam Scherlis Kshitij Sachan Adam Jermyn Joe Benton Buck Shlegeris MILM 133 25 0 04 Oct 2022
Omnigrok: Grokking Beyond Algorithmic Data Ziming Liu Eric J. Michaud Max Tegmark 54 76 0 03 Oct 2022
Disentanglement with Biological Constraints: A Theory of Functional Cell Types James C. R. Whittington W. Dorrell Surya Ganguli Timothy Edward John Behrens 34 39 0 30 Sep 2022
Causal Proxy Models for Concept-Based Model Explanations Zhengxuan Wu Karel DÓosterlinck Atticus Geiger Amir Zur Christopher Potts MILM 68 35 0 28 Sep 2022
In-context Learning and Induction Heads Catherine Olsson Nelson Elhage Neel Nanda Nicholas Joseph Nova Dassarma ... Tom B. Brown Jack Clark Jared Kaplan Sam McCandlish C. Olah 240 453 0 24 Sep 2022
Toy Models of Superposition Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer T. Henighan ... Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg C. Olah AAML MILM 120 314 0 21 Sep 2022
A Survey of Machine Unlearning Thanh Tam Nguyen T. T. Huynh Phi Le Nguyen Alan Wee-Chung Liew Hongzhi Yin Quoc Viet Hung Nguyen MU 77 216 0 06 Sep 2022
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 301 11,730 0 04 Mar 2022
Causal Distillation for Language Models Zhengxuan Wu Atticus Geiger J. Rozner Elisa Kreiss Hanson Lu Thomas F. Icard Christopher Potts Noah D. Goodman 81 25 0 05 Dec 2021
Unsolved Problems in ML Safety Dan Hendrycks Nicholas Carlini John Schulman Jacob Steinhardt 164 268 0 28 Sep 2021
Probing Classifiers: Promises, Shortcomings, and Advances Yonatan Belinkov 221 402 0 24 Feb 2021
Towards A Rigorous Science of Interpretable Machine Learning Finale Doshi-Velez Been Kim XAI FaML 225 3,658 0 28 Feb 2017