v1v2 (latest)

Compositional Explanations of Neurons

Neural Information Processing Systems (NeurIPS), 2020

24 June 2020

Papers citing "Compositional Explanations of Neurons"

50 / 146 papers shown

Understanding polysemanticity in neural networks through coding theory

Simon C. Marshall

Jan H. Kirchner

FAtt MILM AAML

186

31 Jan 2024

Rethinking Interpretability in the Era of Large Language Models

300

115

30 Jan 2024

Towards Generating Informative Textual Description for Neurons in Language Models

180

30 Jan 2024

Knowledge-Aware Neuron Interpretation for Scene ClassificationAAAI Conference on Artificial Intelligence (AAAI), 2024

193

29 Jan 2024

Black-Box Access is Insufficient for Rigorous AI AuditsConference on Fairness, Accountability and Transparency (FAccT), 2024

...

Dylan Hadfield-Menell

AAML

562

135

25 Jan 2024

Universal Neurons in GPT2 Language Models

Wes Gurnee

Theo Horsley

Zifan Carl Guo

Tara Rezaei Kheirkhah

348

22 Jan 2024

Explaining the Implicit Neural Canvas: Connecting Pixels to Neurons by Tracing their Contributions

323

18 Jan 2024

Manipulating Feature Visualizations with Gradient Slingshots

406

11 Jan 2024

MAMI: Multi-Attentional Mutual-Information for Long Sequence Neuron Captioning

Alfirsa Damasyifa Fauzulhaq

Wahyu Parwitayasa

Joseph A. Sugihdharma

M. F. Ridhani

N. Yudistira

195

05 Jan 2024

Large Language Models Relearn Removed ConceptsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

223

03 Jan 2024

Concept-based Explainable Artificial Intelligence: A Survey

271

20 Dec 2023

A Glitch in the Matrix? Locating and Detecting Language Model Grounding with FakepediaAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

505

04 Dec 2023

Adversarial Doodles: Interpretable and Human-drawable Attacks Provide Describable Insights

Ryoya Nara

Yusuke Matsui

AAML

282

27 Nov 2023

Labeling Neural Representations with Inverse RecognitionNeural Information Processing Systems (NeurIPS), 2023

453

22 Nov 2023

Investigating the Encoding of Words in BERT's Neurons using Feature TextualizationBlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackboxNLP), 2023

Simon Ostermann

255

14 Nov 2023

Interpreting Pretrained Language Models via Concept Bottlenecks

Huan Liu

236

08 Nov 2023

Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models

379

07 Nov 2023

Towards a fuller understanding of neurons with Clustered Compositional ExplanationsNeural Information Processing Systems (NeurIPS), 2023

Biagio La Rosa

Leilani H. Gilpin

Roberto Capobianco

225

27 Oct 2023

Codebook Features: Sparse and Discrete Interpretability for Neural NetworksInternational Conference on Machine Learning (ICML), 2023

Alex Tamkin

Mohammad Taufeeque

Noah D. Goodman

220

26 Oct 2023

How do Language Models Bind Entities in Context?International Conference on Learning Representations (ICLR), 2023

Jiahai Feng

Jacob Steinhardt

325

26 Oct 2023

Corrupting Neuron Explanations of Deep Visual FeaturesIEEE International Conference on Computer Vision (ICCV), 2023

128

25 Oct 2023

From Neural Activations to Concepts: A Survey on Explaining Concepts in Neural Networks

Jae Hee Lee

Sergio Lanza

Stefan Wermter

238

18 Oct 2023

Copy Suppression: Comprehensively Understanding an Attention Head

268

06 Oct 2023

From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction TuningNorth American Chapter of the Association for Computational Linguistics (NAACL), 2023

Wenlin Yao

Ninghao Liu

Dong Yu

LRM

275

30 Sep 2023

Towards Best Practices of Activation Patching in Language Models: Metrics and MethodsInternational Conference on Learning Representations (ICLR), 2023

Fred Zhang

Neel Nanda

LLMSV

531

175

27 Sep 2023

Rigorously Assessing Natural Language Explanations of NeuronsBlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackboxNLP), 2023

241

19 Sep 2023

FIND: A Function Description Benchmark for Evaluating Interpretability MethodsNeural Information Processing Systems (NeurIPS), 2023

Shuang Li

265

07 Sep 2023

Explainability for Large Language Models: A SurveyACM Transactions on Intelligent Systems and Technology (ACM TIST), 2023

Haiyan Zhao

Hanjie Chen

Fan Yang

Ninghao Liu

500

710

02 Sep 2023

Emergent Linear Representations in World Models of Self-Supervised Sequence ModelsBlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackboxNLP), 2023

316

260

02 Sep 2023

Identifying Interpretable Subspaces in Image RepresentationsInternational Conference on Machine Learning (ICML), 2023

303

20 Jul 2023

Hierarchical Semantic Tree Concept Whitening for Interpretable Image Classification

Lu Zhang

...

Yanjun Lyu

Changying Li

Ninghao Liu

Tianming Liu

Dajiang Zhu

259

10 Jul 2023

Dear XAI Community, We Need to Talk! Fundamental Misconceptions in Current XAI Research

Timo Freiesleben

Gunnar Konig

157

07 Jun 2023

A Survey on Explainability of Graph Neural NetworksIEEE Data Engineering Bulletin (IEEE Data Eng. Bull.), 2023

252

02 Jun 2023

Neuron to Graph: Interpreting Language Model Neurons at Scale

203

31 May 2023

NeuroX Library for Neuron Analysis of Deep NLP ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Fahim Dalvi

Hassan Sajjad

Nadir Durrani

239

26 May 2023

FICNN: A Framework for the Interpretation of Deep Convolutional Neural Networks

Hamed Behzadi-Khormouji

José Oramas

165

17 May 2023

Finding Neurons in a Haystack: Case Studies with Sparse Probing

540

291

02 May 2023

Towards Automated Circuit Discovery for Mechanistic InterpretabilityNeural Information Processing Systems (NeurIPS), 2023

Arthur Conmy

Augustine N. Mavor-Parker

Aengus Lynch

Stefan Heimersheim

Adrià Garriga-Alonso

542

460

28 Apr 2023

Concept-Monitor: Understanding DNN training through individual neurons

Mohammad Ali Khan

Tuomas P. Oikarinen

Tsui-Wei Weng

244

26 Apr 2023

N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models

157

22 Apr 2023

LINe: Out-of-Distribution Detection by Leveraging Important NeuronsComputer Vision and Pattern Recognition (CVPR), 2023

312

24 Mar 2023

Unsupervised Interpretable Basis Extraction for Concept-Based Visual ExplanationsIEEE Transactions on Artificial Intelligence (IEEE TAI), 2023

Alexandros Doumanoglou

S. Asteriadis

D. Zarpalas

FAtt SSL

219

19 Mar 2023

Red Teaming Deep Neural Networks with Feature Synthesis ToolsNeural Information Processing Systems (NeurIPS), 2023

Dylan Hadfield-Menell

AAML

406

08 Feb 2023

A Survey of Explainable AI in Deep Visual Modeling: Methods and Metrics

Naveed Akhtar

XAI VLM

202

31 Jan 2023

Evaluating Neuron Interpretation Methods of NLP ModelsNeural Information Processing Systems (NeurIPS), 2023

273

30 Jan 2023

Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language ModelsNeural Information Processing Systems (NeurIPS), 2023

348

234

10 Jan 2023

Can Large Language Models Change User Preference Adversarially?

Varshini Subhash

AAML

191

05 Jan 2023

Teaching Matters: Investigating the Role of Supervision in Vision TransformersComputer Vision and Pattern Recognition (CVPR), 2022

378

07 Dec 2022

What learning algorithm is in-context learning? Investigations with linear modelsInternational Conference on Learning Representations (ICLR), 2022

548

620

28 Nov 2022

Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image ClassificationComputer Vision and Pattern Recognition (CVPR), 2022

406

311

21 Nov 2022