v1v2 (latest)

Compositional Explanations of Neurons

Neural Information Processing Systems (NeurIPS), 2020

24 June 2020

Papers citing "Compositional Explanations of Neurons"

50 / 146 papers shown

Guaranteed Optimal Compositional Explanations for Neurons

Biagio La Rosa

Leilani H. Gilpin

25 Nov 2025

Open Vocabulary Compositional Explanations for Neuron Alignment

Biagio La Rosa

Leilani H. Gilpin

OCL

339

25 Nov 2025

Where Culture Fades: Revealing the Cultural Gap in Text-to-Image Generation

...

228

21 Nov 2025

Automated Detection of Visual Attribute Reliance with a Self-Reflective Agent

197

24 Oct 2025

Programmatic Representation Learning with Language Models

Gabriel Poesia

Georgia Gabriela Sampaio

16 Oct 2025

Interpreting Language Models Through Concept Descriptions: A Survey

Nils Feldhus

Laura Kopf

MILM

154

01 Oct 2025

Negative Pre-activations Differentiate Syntax

127

29 Sep 2025

NeuroStrike: Neuron-Level Attacks on Aligned LLMs

270

15 Sep 2025

On the Performance of Concept Probing: The Influence of the Data (Extended Version)

Manuel de Sousa Ribeiro

Afonso Leote

João Leite

197

24 Jul 2025

Concept Probing: Where to Find Human-Defined Concepts (Extended Version)

Manuel de Sousa Ribeiro

Afonso Leote

João Leite

189

24 Jul 2025

Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework

409

18 Jun 2025

Evaluating Neuron Explanations: A Unified Framework with Sanity Checks

175

06 Jun 2025

Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors

454

17 May 2025

Disentangling Polysemantic Channels in Convolutional Neural Networks

270

17 Apr 2025

Following the Whispers of Values: Unraveling Neural Mechanisms Behind Value-Oriented Behaviors in LLMs

389

07 Apr 2025

HyperDAS: Towards Automating Mechanistic Interpretability with HypernetworksInternational Conference on Learning Representations (ICLR), 2025

430

13 Mar 2025

Steered Generation via Gradient Descent on Sparse Features

Sumanta Bhattacharyya

Pedram Rooshenas

LLMSV

304

25 Feb 2025

On Relation-Specific Neurons in Large Language Models

Amir Hossein Kargaran

311

24 Feb 2025

NeurFlow: Interpreting Neural Networks through Neuron Groups and Functional InteractionsInternational Conference on Learning Representations (ICLR), 2025

551

22 Feb 2025

LaVCa: LLM-assisted Visual Cortex Captioning

Takuya Matsuyama

Shinji Nishimoto

Yu Takagi

318

20 Feb 2025

Discovering Chunks in Neural Embeddings for Interpretability

295

03 Feb 2025

Compositional Concept-Based Neuron-Level Interpretability for Deep Reinforcement Learning

212

02 Feb 2025

Towards Utilising a Range of Neural Activations for Comprehending Representational AssociationsIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2024

Laura O'Mahony

Nikola S. Nikolov

David JP O'Sullivan

448

15 Nov 2024

Understanding Internal Representations of Recommendation Models with Sparse Autoencoders

278

09 Nov 2024

Beyond Interpretability: The Gains of Feature Monosemanticity on Model RobustnessInternational Conference on Learning Representations (ICLR), 2024

Qi Zhang

Yisen Wang

302

27 Oct 2024

Hypothesis Testing the Circuit Hypothesis in LLMsNeural Information Processing Systems (NeurIPS), 2024

Claudia Shi

Nicolas Beltran-Velez

266

16 Oct 2024

Neuron-based Personality Trait Induction in Large Language Models

252

16 Oct 2024

Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family ExpertsInternational Conference on Learning Representations (ICLR), 2024

Xidong Wang

315

14 Oct 2024

Investigating Representation Universality: Case Study on Genealogical Representations

David D. Baek

Yuxiao Li

Max Tegmark

273

10 Oct 2024

Mechanistic?BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackBoxNLP), 2024

Naomi Saphra

Sarah Wiegreffe

AI4CE

263

07 Oct 2024

Linking in Style: Understanding learned features in deep learning modelsEuropean Conference on Computer Vision (ECCV), 2024

Maren H. Wehrheim

Pamela Osuna-Vargas

Matthias Kaschube

GAN

213

25 Sep 2024

Unveiling Language Competence Neurons: A Psycholinguistic Approach to Model InterpretabilityInternational Conference on Computational Linguistics (COLING), 2024

Xufeng Duan

Xinyu Zhou

Bei Xiao

Zhenguang G. Cai

MILM

215

24 Sep 2024

Optimal ablation for interpretabilityNeural Information Processing Systems (NeurIPS), 2024

Maximilian Li

Lucas Janson

FAtt

343

16 Sep 2024

Interpreting and Improving Large Language Models in Arithmetic CalculationInternational Conference on Machine Learning (ICML), 2024

Wei Zhang

Chaoqun Wan

Yonggang Zhang

Yiu-ming Cheung

Xinmei Tian

Xu Shen

Jieping Ye

LRM

342

03 Sep 2024

Towards Symbolic XAI -- Explanation Through Human Understandable Logical Relationships Between FeaturesInformation Fusion (Inf. Fusion), 2024

Thomas Schnake

Farnoush Rezaei Jafaria

Jonas Lederer

Ping Xiong

Shinichi Nakajima

Stefan Gugler

G. Montavon

Klaus-Robert Müller

321

30 Aug 2024

Unsupervised Composable Representations for AudioInternational Society for Music Information Retrieval Conference (ISMIR), 2024

Giovanni Bindi

P. Esling

DiffM OCL CoGe

290

19 Aug 2024

Interpreting Attention Layer Outputs with Sparse Autoencoders

267

25 Jun 2024

MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model

Xuming Hu

266

17 Jun 2024

Talking Heads: Understanding Inter-layer Communication in Transformer Language Models

Jack Merullo

Carsten Eickhoff

Ellie Pavlick

560

13 Jun 2024

LLM-assisted Concept Discovery: Automatically Identifying and Explaining Neuron Functions

N. Hoang-Xuan

Minh Nhat Vu

My T. Thai

228

12 Jun 2024

Graphical Perception of Saliency-based Model Explanations

Yayan Zhao

Mingwei Li

Matthew Berger

XAI FAtt

342

11 Jun 2024

Position: An Inner Interpretability Framework for AI Inspired by Lessons from Cognitive Neuroscience

Martina G. Vilas

Federico Adolfi

David Poeppel

Gemma Roig

313

03 Jun 2024

CoSy: Evaluating Textual Explanations of Neurons

210

30 May 2024

Linear Explanations for Individual Neurons

Tuomas P. Oikarinen

Tsui-Wei Weng

FAtt MILM

265

10 May 2024

Mechanistic Interpretability for AI Safety -- A Review

Leonard Bereska

E. Gavves

AI4CE

386

307

22 Apr 2024

A Multimodal Automated Interpretability Agent

533

22 Apr 2024

Decomposing and Editing Predictions by Modeling Model Computation

Harshay Shah

Andrew Ilyas

Aleksander Madry

KELM

297

17 Apr 2024

The SaTML '24 CNN Interpretability Competition: New Innovations for Concept-Level Interpretability

...

Dylan Hadfield-Menell

284

03 Apr 2024

WWW: A Unified Framework for Explaining What, Where and Why of Neural Networks by Interpretation of Neuron Concepts

Yong Hyun Ahn

Hyeon Bae Kim

Seong Tae Kim

267

29 Feb 2024

Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models

363

26 Feb 2024