Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2404.14082
Cited By
Mechanistic Interpretability for AI Safety -- A Review
22 April 2024
Leonard Bereska
E. Gavves
AI4CE
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Mechanistic Interpretability for AI Safety -- A Review"
12 / 62 papers shown
Title
Polysemanticity and Capacity in Neural Networks
Adam Scherlis
Kshitij Sachan
Adam Jermyn
Joe Benton
Buck Shlegeris
MILM
133
25
0
04 Oct 2022
Omnigrok: Grokking Beyond Algorithmic Data
Ziming Liu
Eric J. Michaud
Max Tegmark
54
76
0
03 Oct 2022
Disentanglement with Biological Constraints: A Theory of Functional Cell Types
James C. R. Whittington
W. Dorrell
Surya Ganguli
Timothy Edward John Behrens
34
39
0
30 Sep 2022
Causal Proxy Models for Concept-Based Model Explanations
Zhengxuan Wu
Karel DÓosterlinck
Atticus Geiger
Amir Zur
Christopher Potts
MILM
68
35
0
28 Sep 2022
In-context Learning and Induction Heads
Catherine Olsson
Nelson Elhage
Neel Nanda
Nicholas Joseph
Nova Dassarma
...
Tom B. Brown
Jack Clark
Jared Kaplan
Sam McCandlish
C. Olah
240
453
0
24 Sep 2022
Toy Models of Superposition
Nelson Elhage
Tristan Hume
Catherine Olsson
Nicholas Schiefer
T. Henighan
...
Sam McCandlish
Jared Kaplan
Dario Amodei
Martin Wattenberg
C. Olah
AAML
MILM
120
314
0
21 Sep 2022
A Survey of Machine Unlearning
Thanh Tam Nguyen
T. T. Huynh
Phi Le Nguyen
Alan Wee-Chung Liew
Hongzhi Yin
Quoc Viet Hung Nguyen
MU
77
216
0
06 Sep 2022
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
301
11,730
0
04 Mar 2022
Causal Distillation for Language Models
Zhengxuan Wu
Atticus Geiger
J. Rozner
Elisa Kreiss
Hanson Lu
Thomas F. Icard
Christopher Potts
Noah D. Goodman
81
25
0
05 Dec 2021
Unsolved Problems in ML Safety
Dan Hendrycks
Nicholas Carlini
John Schulman
Jacob Steinhardt
164
268
0
28 Sep 2021
Probing Classifiers: Promises, Shortcomings, and Advances
Yonatan Belinkov
221
402
0
24 Feb 2021
Towards A Rigorous Science of Interpretable Machine Learning
Finale Doshi-Velez
Been Kim
XAI
FaML
225
3,658
0
28 Feb 2017
Previous
1
2