Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2009.05041
Cited By
v1
v2 (latest)
Understanding the Role of Individual Units in a Deep Neural Network
Proceedings of the National Academy of Sciences of the United States of America (PNAS), 2020
10 September 2020
David Bau
Jun-Yan Zhu
Hendrik Strobelt
Àgata Lapedriza
Bolei Zhou
Antonio Torralba
GAN
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Understanding the Role of Individual Units in a Deep Neural Network"
50 / 233 papers shown
The Quest for the Right Mediator: Surveying Mechanistic Interpretability Through the Lens of Causal Mediation Analysis
Computational Linguistics (CL), 2024
Aaron Mueller
Jannik Brinkmann
Millicent Li
Samuel Marks
Koyena Pal
...
Arnab Sen Sharma
Jiuding Sun
Eric Todd
David Bau
Yonatan Belinkov
CML
497
34
0
02 Aug 2024
States Hidden in Hidden States: LLMs Emerge Discrete State Representations Implicitly
Junhao Chen
Shengding Hu
Zhiyuan Liu
Maosong Sun
LRM
188
9
0
16 Jul 2024
Unveiling the Unseen: Exploring Whitebox Membership Inference through the Lens of Explainability
Chenxi Li
Abhinav Kumar
Zhen Guo
Jie Hou
R. Tourani
AAML
MIACV
240
4
0
01 Jul 2024
Human-like object concept representations emerge naturally in multimodal large language models
Changde Du
Kaicheng Fu
Bincheng Wen
Yi Sun
Jie Peng
...
Chuncheng Zhang
Jinpeng Li
Shuang Qiu
Le Chang
Huiguang He
466
19
0
01 Jul 2024
AND: Audio Network Dissection for Interpreting Deep Acoustic Models
Tung-Yu Wu
Yu-Xiang Lin
Tsui-Wei Weng
361
3
0
24 Jun 2024
Beyond Individual Facts: Investigating Categorical Knowledge Locality of Taxonomy and Meronomy Concepts in GPT Models
Christopher Burger
Yifan Hu
Thai Le
KELM
184
0
0
22 Jun 2024
LLM-assisted Concept Discovery: Automatically Identifying and Explaining Neuron Functions
N. Hoang-Xuan
Minh Nhat Vu
My T. Thai
218
5
0
12 Jun 2024
Interpreting the Second-Order Effects of Neurons in CLIP
Yossi Gandelsman
Alexei A. Efros
Jacob Steinhardt
MILM
442
32
0
06 Jun 2024
Iteration Head: A Mechanistic Study of Chain-of-Thought
Vivien A. Cabannes
Charles Arnal
Wassim Bouaziz
Alice Yang
Francois Charton
Julia Kempe
LRM
307
27
0
04 Jun 2024
Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP
S. Balasubramanian
Samyadeep Basu
Soheil Feizi
CLIP
255
14
0
03 Jun 2024
From Feature Visualization to Visual Circuits: Effect of Adversarial Model Manipulation
Géraldin Nanfack
Michael Eickenberg
Eugene Belilovsky
FAtt
AAML
GNN
300
1
0
03 Jun 2024
Crafting Interpretable Embeddings by Asking LLMs Questions
Vinamra Benara
Chandan Singh
John X. Morris
Richard Antonello
Ion Stoica
Alexander G. Huth
Jianfeng Gao
239
11
0
26 May 2024
Pruning for Robust Concept Erasing in Diffusion Models
Tianyun Yang
Juan Cao
Chang Xu
336
24
0
26 May 2024
Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories
Tianlong Wang
Xianfeng Jiao
Yifan He
Zhongzhi Chen
Yinghao Zhu
Xu Chu
Junyi Gao
Yasha Wang
Liantao Ma
LLMSV
425
51
0
26 May 2024
Error-margin Analysis for Hidden Neuron Activation Labels
International Workshop on Neural-Symbolic Learning and Reasoning (NeSy), 2024
Abhilekha Dalal
R. Rayan
Pascal Hitzler
FAtt
202
1
0
14 May 2024
Linear Explanations for Individual Neurons
Tuomas P. Oikarinen
Tsui-Wei Weng
FAtt
MILM
259
15
0
10 May 2024
Automatic Discovery of Visual Circuits
Achyuta Rajaram
Neil Chowdhury
Antonio Torralba
Jacob Andreas
Sarah Schwettmann
GNN
177
7
0
22 Apr 2024
A Multimodal Automated Interpretability Agent
Tamar Rott Shaham
Sarah Schwettmann
Franklin Wang
Achyuta Rajaram
Evan Hernandez
Jacob Andreas
Antonio Torralba
505
44
0
22 Apr 2024
On the Value of Labeled Data and Symbolic Methods for Hidden Neuron Activation Analysis
Abhilekha Dalal
R. Rayan
Adrita Barua
Eugene Y. Vasserman
Md Kamruzzaman Sarker
Pascal Hitzler
224
11
0
21 Apr 2024
Decomposing and Editing Predictions by Modeling Model Computation
Harshay Shah
Andrew Ilyas
Aleksander Madry
KELM
290
24
0
17 Apr 2024
Faster Diffusion via Temporal Attention Decomposition
Haozhe Liu
Wentian Zhang
Jinheng Xie
Francesco Faccio
Mengmeng Xu
Tao Xiang
Mike Zheng Shou
Juan-Manuel Perez-Rua
Jürgen Schmidhuber
DiffM
503
40
0
03 Apr 2024
HOLMES: HOLonym-MEronym based Semantic inspection for Convolutional Image Classifiers
Francesco Dibitonto
Fabio Garcea
Andre' Panisson
Alan Perotti
Lia Morra
AAML
220
0
0
13 Mar 2024
Language Models Represent Beliefs of Self and Others
Wentao Zhu
Zhining Zhang
Yizhou Wang
MILM
LRM
338
16
0
28 Feb 2024
Understanding the Role of Pathways in a Deep Neural Network
Lei Lyu
Chen Pang
Jihua Wang
197
4
0
28 Feb 2024
Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models
Tianyi Tang
Wenyang Luo
Haoyang Huang
Dongdong Zhang
Xiaolei Wang
Xin Zhao
Furu Wei
Ji-Rong Wen
347
92
0
26 Feb 2024
Explorations of Self-Repair in Language Models
Cody Rushing
Neel Nanda
KELM
MILM
LRM
197
18
0
23 Feb 2024
Advancing Explainable AI Toward Human-Like Intelligence: Forging the Path to Artificial Brain
Yongchen Zhou
Richard Jiang
294
5
0
07 Feb 2024
Universal Neurons in GPT2 Language Models
Wes Gurnee
Theo Horsley
Zifan Carl Guo
Tara Rezaei Kheirkhah
Qinyi Sun
Will Hathaway
Neel Nanda
Dimitris Bertsimas
MILM
338
79
0
22 Jan 2024
Edit One for All: Interactive Batch Image Editing
Thao Nguyen
Utkarsh Ojha
Yuheng Li
Haotian Liu
Yong Jae Lee
DiffM
213
5
0
18 Jan 2024
Manipulating Feature Visualizations with Gradient Slingshots
Dilyara Bareeva
Marina M.-C. Höhne
Alexander Warnecke
Lukas Pirch
Klaus-Robert Müller
Konrad Rieck
Sebastian Lapuschkin
Kirill Bykov
AAML
386
6
0
11 Jan 2024
Fast gradient-free activation maximization for neurons in spiking neural networks
N. Pospelov
Andrei Chertkov
Maxim Beketov
Ivan Oseledets
Konstantin Anokhin
191
3
0
28 Dec 2023
Learning from Emergence: A Study on Proactively Inhibiting the Monosemantic Neurons of Artificial Neural Networks
Jiachuan Wang
Hanmo Liu
Lei Chen
Charles Wang Wai Ng
122
6
0
17 Dec 2023
Deeper Understanding of Black-box Predictions via Generalized Influence Functions
Hyeonsu Lyu
Jonggyu Jang
Sehyun Ryu
H. Yang
TDI
AI4CE
294
7
0
09 Dec 2023
Interpretability Illusions in the Generalization of Simplified Models
Dan Friedman
Andrew Kyle Lampinen
Lucas Dixon
Danqi Chen
Asma Ghandeharioun
358
19
0
06 Dec 2023
Data-Centric Digital Agriculture: A Perspective
R. Roscher
Lukas Roth
C. Stachniss
Achim Walter
235
5
0
06 Dec 2023
Conceptualizing the Relationship between AI Explanations and User Agency
Iyadunni Adenuga
Jonathan Dodge
171
4
0
05 Dec 2023
Finding and Editing Multi-Modal Neurons in Pre-Trained Transformers
Annual Meeting of the Association for Computational Linguistics (ACL), 2023
Haowen Pan
Yixin Cao
Xiaozhi Wang
Xun Yang
Meng Wang
KELM
300
37
0
13 Nov 2023
Towards a fuller understanding of neurons with Clustered Compositional Explanations
Neural Information Processing Systems (NeurIPS), 2023
Biagio La Rosa
Leilani H. Gilpin
Roberto Capobianco
214
14
0
27 Oct 2023
Codebook Features: Sparse and Discrete Interpretability for Neural Networks
International Conference on Machine Learning (ICML), 2023
Alex Tamkin
Mohammad Taufeeque
Noah D. Goodman
207
40
0
26 Oct 2023
Corrupting Neuron Explanations of Deep Visual Features
IEEE International Conference on Computer Vision (ICCV), 2023
Divyansh Srivastava
Tuomas P. Oikarinen
Tsui-Wei Weng
FAtt
AAML
119
3
0
25 Oct 2023
Automated Natural Language Explanation of Deep Visual Neurons with Large Models
AAAI Conference on Artificial Intelligence (AAAI), 2023
Chenxu Zhao
Wei Qian
Yucheng Shi
Mengdi Huai
Ninghao Liu
134
5
0
16 Oct 2023
NeuroInspect: Interpretable Neuron-based Debugging Framework through Class-conditional Visualizations
Yeong-Joon Ju
Ji-Hoon Park
Seong-Whan Lee
AAML
208
0
0
11 Oct 2023
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
Samuel Marks
Max Tegmark
HILM
471
351
0
10 Oct 2023
Interpreting CLIP's Image Representation via Text-Based Decomposition
International Conference on Learning Representations (ICLR), 2023
Yossi Gandelsman
Alexei A. Efros
Jacob Steinhardt
VLM
477
150
0
09 Oct 2023
Unlearning with Fisher Masking
Yufang Liu
Changzhi Sun
Man Lan
Aimin Zhou
MU
211
9
0
09 Oct 2023
Semantic Adversarial Attacks via Diffusion Models
British Machine Vision Conference (BMVC), 2023
Chenan Wang
Jinhao Duan
Chaowei Xiao
Edward Kim
Matthew C. Stamm
Kaidi Xu
DiffM
196
16
0
14 Sep 2023
FIND: A Function Description Benchmark for Evaluating Interpretability Methods
Neural Information Processing Systems (NeurIPS), 2023
Sarah Schwettmann
Tamar Rott Shaham
Joanna Materzyñska
Neil Chowdhury
Shuang Li
Jacob Andreas
David Bau
Antonio Torralba
257
31
0
07 Sep 2023
Emergent Linear Representations in World Models of Self-Supervised Sequence Models
BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackboxNLP), 2023
Neel Nanda
Andrew Lee
Martin Wattenberg
FAtt
MILM
311
247
0
02 Sep 2023
Learning to Identify Critical States for Reinforcement Learning from Videos
IEEE International Conference on Computer Vision (ICCV), 2023
Haozhe Liu
Mingchen Zhuge
Bing Li
Yu‐Han Wang
Francesco Faccio
Guohao Li
Jürgen Schmidhuber
OffRL
276
14
0
15 Aug 2023
A Preliminary Study of the Intrinsic Relationship between Complexity and Alignment
International Conference on Language Resources and Evaluation (LREC), 2023
Ying Zhao
Yu Bowen
Binyuan Hui
Haiyang Yu
Fei Huang
Yongbin Li
Ningyu Zhang
247
34
0
10 Aug 2023
Previous
1
2
3
4
5
Next