v1v2 (latest)

Understanding the Role of Individual Units in a Deep Neural Network

Proceedings of the National Academy of Sciences of the United States of America (PNAS), 2020

10 September 2020

Jun-Yan Zhu

Antonio Torralba

Papers citing "Understanding the Role of Individual Units in a Deep Neural Network"

50 / 233 papers shown

The Quest for the Right Mediator: Surveying Mechanistic Interpretability Through the Lens of Causal Mediation AnalysisComputational Linguistics (CL), 2024

...

497

02 Aug 2024

States Hidden in Hidden States: LLMs Emerge Discrete State Representations Implicitly

Junhao Chen

Shengding Hu

Zhiyuan Liu

Maosong Sun

LRM

188

16 Jul 2024

Unveiling the Unseen: Exploring Whitebox Membership Inference through the Lens of Explainability

240

01 Jul 2024

Human-like object concept representations emerge naturally in multimodal large language models

...

466

01 Jul 2024

AND: Audio Network Dissection for Interpreting Deep Acoustic Models

Tung-Yu Wu

Yu-Xiang Lin

Tsui-Wei Weng

361

24 Jun 2024

Beyond Individual Facts: Investigating Categorical Knowledge Locality of Taxonomy and Meronomy Concepts in GPT Models

Christopher Burger

Yifan Hu

Thai Le

KELM

184

22 Jun 2024

LLM-assisted Concept Discovery: Automatically Identifying and Explaining Neuron Functions

N. Hoang-Xuan

Minh Nhat Vu

My T. Thai

218

12 Jun 2024

Interpreting the Second-Order Effects of Neurons in CLIP

442

06 Jun 2024

Iteration Head: A Mechanistic Study of Chain-of-Thought

307

04 Jun 2024

Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP

255

03 Jun 2024

From Feature Visualization to Visual Circuits: Effect of Adversarial Model Manipulation

300

03 Jun 2024

Crafting Interpretable Embeddings by Asking LLMs Questions

239

26 May 2024

Pruning for Robust Concept Erasing in Diffusion Models

Tianyun Yang

Juan Cao

Chang Xu

336

26 May 2024

Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories

425

26 May 2024

Error-margin Analysis for Hidden Neuron Activation LabelsInternational Workshop on Neural-Symbolic Learning and Reasoning (NeSy), 2024

202

14 May 2024

Linear Explanations for Individual Neurons

Tuomas P. Oikarinen

Tsui-Wei Weng

FAtt MILM

259

10 May 2024

Automatic Discovery of Visual Circuits

177

22 Apr 2024

A Multimodal Automated Interpretability Agent

505

22 Apr 2024

On the Value of Labeled Data and Symbolic Methods for Hidden Neuron Activation Analysis

Md Kamruzzaman Sarker

Pascal Hitzler

224

21 Apr 2024

Decomposing and Editing Predictions by Modeling Model Computation

Harshay Shah

Andrew Ilyas

Aleksander Madry

KELM

290

17 Apr 2024

Faster Diffusion via Temporal Attention Decomposition

Juan-Manuel Perez-Rua

Jürgen Schmidhuber

DiffM

503

03 Apr 2024

HOLMES: HOLonym-MEronym based Semantic inspection for Convolutional Image Classifiers

220

13 Mar 2024

Language Models Represent Beliefs of Self and Others

338

28 Feb 2024

Understanding the Role of Pathways in a Deep Neural Network

Lei Lyu

Chen Pang

Jihua Wang

197

28 Feb 2024

Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models

347

26 Feb 2024

Explorations of Self-Repair in Language Models

Cody Rushing

Neel Nanda

KELM MILM LRM

197

23 Feb 2024

Advancing Explainable AI Toward Human-Like Intelligence: Forging the Path to Artificial Brain

Yongchen Zhou

Richard Jiang

294

07 Feb 2024

Universal Neurons in GPT2 Language Models

Wes Gurnee

Theo Horsley

Zifan Carl Guo

Tara Rezaei Kheirkhah

338

22 Jan 2024

Edit One for All: Interactive Batch Image Editing

Thao Nguyen

213

18 Jan 2024

Manipulating Feature Visualizations with Gradient Slingshots

386

11 Jan 2024

Fast gradient-free activation maximization for neurons in spiking neural networks

191

28 Dec 2023

Learning from Emergence: A Study on Proactively Inhibiting the Monosemantic Neurons of Artificial Neural Networks

122

17 Dec 2023

Deeper Understanding of Black-box Predictions via Generalized Influence Functions

294

09 Dec 2023

Interpretability Illusions in the Generalization of Simplified Models

358

06 Dec 2023

Data-Centric Digital Agriculture: A Perspective

235

06 Dec 2023

Conceptualizing the Relationship between AI Explanations and User Agency

Iyadunni Adenuga

Jonathan Dodge

171

05 Dec 2023

Finding and Editing Multi-Modal Neurons in Pre-Trained TransformersAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Xiaozhi Wang

300

13 Nov 2023

Towards a fuller understanding of neurons with Clustered Compositional ExplanationsNeural Information Processing Systems (NeurIPS), 2023

Biagio La Rosa

Leilani H. Gilpin

Roberto Capobianco

214

27 Oct 2023

Codebook Features: Sparse and Discrete Interpretability for Neural NetworksInternational Conference on Machine Learning (ICML), 2023

Alex Tamkin

Mohammad Taufeeque

Noah D. Goodman

207

26 Oct 2023

Corrupting Neuron Explanations of Deep Visual FeaturesIEEE International Conference on Computer Vision (ICCV), 2023

119

25 Oct 2023

Automated Natural Language Explanation of Deep Visual Neurons with Large ModelsAAAI Conference on Artificial Intelligence (AAAI), 2023

Ninghao Liu

134

16 Oct 2023

NeuroInspect: Interpretable Neuron-based Debugging Framework through Class-conditional Visualizations

208

11 Oct 2023

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

Samuel Marks

Max Tegmark

HILM

471

351

10 Oct 2023

Interpreting CLIP's Image Representation via Text-Based DecompositionInternational Conference on Learning Representations (ICLR), 2023

477

150

09 Oct 2023

Unlearning with Fisher Masking

211

09 Oct 2023

Semantic Adversarial Attacks via Diffusion ModelsBritish Machine Vision Conference (BMVC), 2023

196

14 Sep 2023

FIND: A Function Description Benchmark for Evaluating Interpretability MethodsNeural Information Processing Systems (NeurIPS), 2023

Shuang Li

257

07 Sep 2023

Emergent Linear Representations in World Models of Self-Supervised Sequence ModelsBlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackboxNLP), 2023

311

247

02 Sep 2023

Learning to Identify Critical States for Reinforcement Learning from VideosIEEE International Conference on Computer Vision (ICCV), 2023

Bing Li

276

15 Aug 2023

A Preliminary Study of the Intrinsic Relationship between Complexity and AlignmentInternational Conference on Language Resources and Evaluation (LREC), 2023

Fei Huang

247

10 Aug 2023