Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2006.14032
Cited By
v1
v2 (latest)
Compositional Explanations of Neurons
Neural Information Processing Systems (NeurIPS), 2020
24 June 2020
Jesse Mu
Jacob Andreas
FAtt
CoGe
MILM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Compositional Explanations of Neurons"
50 / 146 papers shown
Guaranteed Optimal Compositional Explanations for Neurons
Biagio La Rosa
Leilani H. Gilpin
80
0
0
25 Nov 2025
Open Vocabulary Compositional Explanations for Neuron Alignment
Biagio La Rosa
Leilani H. Gilpin
OCL
339
0
0
25 Nov 2025
Where Culture Fades: Revealing the Cultural Gap in Text-to-Image Generation
Chuancheng Shi
Shangze Li
Shiming Guo
Simiao Xie
Wenhua Wu
...
Canran Xiao
Cong Wang
Zifeng Cheng
Fei Shen
Tat-Seng Chua
VLM
228
0
0
21 Nov 2025
Automated Detection of Visual Attribute Reliance with a Self-Reflective Agent
Christy Li
Josep Lopez Camunas
Jake Thomas Touchet
Jacob Andreas
Àgata Lapedriza
Antonio Torralba
Tamar Rott Shaham
197
0
0
24 Oct 2025
Programmatic Representation Learning with Language Models
Gabriel Poesia
Georgia Gabriela Sampaio
87
0
0
16 Oct 2025
Interpreting Language Models Through Concept Descriptions: A Survey
Nils Feldhus
Laura Kopf
MILM
154
0
0
01 Oct 2025
Negative Pre-activations Differentiate Syntax
Linghao Kong
Angelina Ning
Micah Adler
Nir Shavit
127
0
0
29 Sep 2025
NeuroStrike: Neuron-Level Attacks on Aligned LLMs
Lichao Wu
Sasha Behrouzi
Mohamadreza Rostami
Maximilian Thang
S. Picek
A. Sadeghi
AAML
270
1
0
15 Sep 2025
On the Performance of Concept Probing: The Influence of the Data (Extended Version)
Manuel de Sousa Ribeiro
Afonso Leote
João Leite
197
1
0
24 Jul 2025
Concept Probing: Where to Find Human-Defined Concepts (Extended Version)
Manuel de Sousa Ribeiro
Afonso Leote
João Leite
189
1
0
24 Jul 2025
Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework
Laura Kopf
Nils Feldhus
Kirill Bykov
P. Bommer
Anna Hedström
Marina M.-C. Höhne
Oliver Eberle
409
4
0
18 Jun 2025
Evaluating Neuron Explanations: A Unified Framework with Sanity Checks
Tuomas P. Oikarinen
Ge Yan
Tsui-Wei Weng
FAtt
XAI
175
7
0
06 Jun 2025
Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors
Jing Huang
Junyi Tao
Thomas Icard
Diyi Yang
Christopher Potts
OODD
454
4
0
17 May 2025
Disentangling Polysemantic Channels in Convolutional Neural Networks
Robin Hesse
Jonas Fischer
Simone Schaub-Meyer
Stefan Roth
FAtt
MILM
270
3
0
17 Apr 2025
Following the Whispers of Values: Unraveling Neural Mechanisms Behind Value-Oriented Behaviors in LLMs
Ling Hu
Yuemei Xu
Xiaoyang Gu
Letao Han
389
1
0
07 Apr 2025
HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks
International Conference on Learning Representations (ICLR), 2025
Jiuding Sun
Jing Huang
Sidharth Baskaran
Karel DÓosterlinck
Christopher Potts
Michael Sklar
Atticus Geiger
AI4CE
430
5
0
13 Mar 2025
Steered Generation via Gradient Descent on Sparse Features
Sumanta Bhattacharyya
Pedram Rooshenas
LLMSV
304
0
0
25 Feb 2025
On Relation-Specific Neurons in Large Language Models
Yihong Liu
Runsheng Chen
Lea Hirlimann
Ahmad Dawar Hakimi
Mingyang Wang
Amir Hossein Kargaran
S. Rothe
François Yvon
Hinrich Schütze
KELM
311
0
0
24 Feb 2025
NeurFlow: Interpreting Neural Networks through Neuron Groups and Functional Interactions
International Conference on Learning Representations (ICLR), 2025
Tue Cao
Nhat X. Hoang
Hieu H. Pham
P. Nguyen
My T. Thai
551
2
0
22 Feb 2025
LaVCa: LLM-assisted Visual Cortex Captioning
Takuya Matsuyama
Shinji Nishimoto
Yu Takagi
318
3
0
20 Feb 2025
Discovering Chunks in Neural Embeddings for Interpretability
Shuchen Wu
Stephan Alaniz
Eric Schulz
Zeynep Akata
295
0
0
03 Feb 2025
Compositional Concept-Based Neuron-Level Interpretability for Deep Reinforcement Learning
Zeyu Jiang
Hai Huang
Xingquan Zuo
OffRL
212
0
0
02 Feb 2025
Towards Utilising a Range of Neural Activations for Comprehending Representational Associations
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2024
Laura O'Mahony
Nikola S. Nikolov
David JP O'Sullivan
448
2
0
15 Nov 2024
Understanding Internal Representations of Recommendation Models with Sparse Autoencoders
Jiayin Wang
Xiaoyu Zhang
Weizhi Ma
Zhiqiang Guo
Min Zhang
278
4
0
09 Nov 2024
Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustness
International Conference on Learning Representations (ICLR), 2024
Qi Zhang
Yifei Wang
Jingyi Cui
Xiang Pan
Qi Lei
Stefanie Jegelka
Yisen Wang
AAML
302
4
0
27 Oct 2024
Hypothesis Testing the Circuit Hypothesis in LLMs
Neural Information Processing Systems (NeurIPS), 2024
Claudia Shi
Nicolas Beltran-Velez
Achille Nazaret
Carolina Zheng
Adrià Garriga-Alonso
Andrew Jesson
Maggie Makar
David M. Blei
266
19
0
16 Oct 2024
Neuron-based Personality Trait Induction in Large Language Models
Jia Deng
Tianyi Tang
Yanbin Yin
Wenhao Yang
Wayne Xin Zhao
Ji-Rong Wen
252
4
0
16 Oct 2024
Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family Experts
International Conference on Learning Representations (ICLR), 2024
Guorui Zheng
Xidong Wang
Juhao Liang
Nuo Chen
Yuping Zheng
Benyou Wang
MoE
315
11
0
14 Oct 2024
Investigating Representation Universality: Case Study on Genealogical Representations
David D. Baek
Yuxiao Li
Max Tegmark
273
3
0
10 Oct 2024
Mechanistic?
BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackBoxNLP), 2024
Naomi Saphra
Sarah Wiegreffe
AI4CE
263
34
0
07 Oct 2024
Linking in Style: Understanding learned features in deep learning models
European Conference on Computer Vision (ECCV), 2024
Maren H. Wehrheim
Pamela Osuna-Vargas
Matthias Kaschube
GAN
213
0
0
25 Sep 2024
Unveiling Language Competence Neurons: A Psycholinguistic Approach to Model Interpretability
International Conference on Computational Linguistics (COLING), 2024
Xufeng Duan
Xinyu Zhou
Bei Xiao
Zhenguang G. Cai
MILM
215
9
0
24 Sep 2024
Optimal ablation for interpretability
Neural Information Processing Systems (NeurIPS), 2024
Maximilian Li
Lucas Janson
FAtt
343
12
0
16 Sep 2024
Interpreting and Improving Large Language Models in Arithmetic Calculation
International Conference on Machine Learning (ICML), 2024
Wei Zhang
Chaoqun Wan
Yonggang Zhang
Yiu-ming Cheung
Xinmei Tian
Xu Shen
Jieping Ye
LRM
342
38
0
03 Sep 2024
Towards Symbolic XAI -- Explanation Through Human Understandable Logical Relationships Between Features
Information Fusion (Inf. Fusion), 2024
Thomas Schnake
Farnoush Rezaei Jafaria
Jonas Lederer
Ping Xiong
Shinichi Nakajima
Stefan Gugler
G. Montavon
Klaus-Robert Müller
321
8
0
30 Aug 2024
Unsupervised Composable Representations for Audio
International Society for Music Information Retrieval Conference (ISMIR), 2024
Giovanni Bindi
P. Esling
DiffM
OCL
CoGe
290
3
0
19 Aug 2024
Interpreting Attention Layer Outputs with Sparse Autoencoders
Connor Kissane
Robert Krzyzanowski
Joseph Isaac Bloom
Arthur Conmy
Neel Nanda
MILM
267
37
0
25 Jun 2024
MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model
Jiahao Huo
Yibo Yan
Boren Hu
Yutao Yue
Xuming Hu
LRM
MLLM
266
16
0
17 Jun 2024
Talking Heads: Understanding Inter-layer Communication in Transformer Language Models
Jack Merullo
Carsten Eickhoff
Ellie Pavlick
560
34
0
13 Jun 2024
LLM-assisted Concept Discovery: Automatically Identifying and Explaining Neuron Functions
N. Hoang-Xuan
Minh Nhat Vu
My T. Thai
228
5
0
12 Jun 2024
Graphical Perception of Saliency-based Model Explanations
Yayan Zhao
Mingwei Li
Matthew Berger
XAI
FAtt
342
2
0
11 Jun 2024
Position: An Inner Interpretability Framework for AI Inspired by Lessons from Cognitive Neuroscience
Martina G. Vilas
Federico Adolfi
David Poeppel
Gemma Roig
313
10
0
03 Jun 2024
CoSy: Evaluating Textual Explanations of Neurons
Laura Kopf
P. Bommer
Anna Hedström
Sebastian Lapuschkin
Marina M.-C. Höhne
Kirill Bykov
210
19
0
30 May 2024
Linear Explanations for Individual Neurons
Tuomas P. Oikarinen
Tsui-Wei Weng
FAtt
MILM
265
15
0
10 May 2024
Mechanistic Interpretability for AI Safety -- A Review
Leonard Bereska
E. Gavves
AI4CE
386
307
0
22 Apr 2024
A Multimodal Automated Interpretability Agent
Tamar Rott Shaham
Sarah Schwettmann
Franklin Wang
Achyuta Rajaram
Evan Hernandez
Jacob Andreas
Antonio Torralba
533
45
0
22 Apr 2024
Decomposing and Editing Predictions by Modeling Model Computation
Harshay Shah
Andrew Ilyas
Aleksander Madry
KELM
297
24
0
17 Apr 2024
The SaTML '24 CNN Interpretability Competition: New Innovations for Concept-Level Interpretability
Stephen Casper
Jieun Yun
Joonhyuk Baek
Yeseong Jung
Minhwan Kim
...
A. Nicolson
Arush Tagade
Jessica Rumbelow
Hieu Minh Nguyen
Dylan Hadfield-Menell
284
2
0
03 Apr 2024
WWW: A Unified Framework for Explaining What, Where and Why of Neural Networks by Interpretation of Neuron Concepts
Yong Hyun Ahn
Hyeon Bae Kim
Seong Tae Kim
267
14
0
29 Feb 2024
Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models
Tianyi Tang
Wenyang Luo
Haoyang Huang
Dongdong Zhang
Xiaolei Wang
Xin Zhao
Furu Wei
Ji-Rong Wen
363
95
0
26 Feb 2024
1
2
3
Next
Page 1 of 3