ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2006.14032
  4. Cited By
Compositional Explanations of Neurons
v1v2 (latest)

Compositional Explanations of Neurons

Neural Information Processing Systems (NeurIPS), 2020
24 June 2020
Jesse Mu
Jacob Andreas
    FAttCoGeMILM
ArXiv (abs)PDFHTML

Papers citing "Compositional Explanations of Neurons"

50 / 146 papers shown
Guaranteed Optimal Compositional Explanations for Neurons
Guaranteed Optimal Compositional Explanations for Neurons
Biagio La Rosa
Leilani H. Gilpin
80
0
0
25 Nov 2025
Open Vocabulary Compositional Explanations for Neuron Alignment
Open Vocabulary Compositional Explanations for Neuron Alignment
Biagio La Rosa
Leilani H. Gilpin
OCL
339
0
0
25 Nov 2025
Where Culture Fades: Revealing the Cultural Gap in Text-to-Image Generation
Where Culture Fades: Revealing the Cultural Gap in Text-to-Image Generation
Chuancheng Shi
Shangze Li
Shiming Guo
Simiao Xie
Wenhua Wu
...
Canran Xiao
Cong Wang
Zifeng Cheng
Fei Shen
Tat-Seng Chua
VLM
228
0
0
21 Nov 2025
Automated Detection of Visual Attribute Reliance with a Self-Reflective Agent
Automated Detection of Visual Attribute Reliance with a Self-Reflective Agent
Christy Li
Josep Lopez Camunas
Jake Thomas Touchet
Jacob Andreas
Àgata Lapedriza
Antonio Torralba
Tamar Rott Shaham
197
0
0
24 Oct 2025
Programmatic Representation Learning with Language Models
Programmatic Representation Learning with Language Models
Gabriel Poesia
Georgia Gabriela Sampaio
87
0
0
16 Oct 2025
Interpreting Language Models Through Concept Descriptions: A Survey
Interpreting Language Models Through Concept Descriptions: A Survey
Nils Feldhus
Laura Kopf
MILM
154
0
0
01 Oct 2025
Negative Pre-activations Differentiate Syntax
Negative Pre-activations Differentiate Syntax
Linghao Kong
Angelina Ning
Micah Adler
Nir Shavit
127
0
0
29 Sep 2025
NeuroStrike: Neuron-Level Attacks on Aligned LLMs
NeuroStrike: Neuron-Level Attacks on Aligned LLMs
Lichao Wu
Sasha Behrouzi
Mohamadreza Rostami
Maximilian Thang
S. Picek
A. Sadeghi
AAML
270
1
0
15 Sep 2025
On the Performance of Concept Probing: The Influence of the Data (Extended Version)
On the Performance of Concept Probing: The Influence of the Data (Extended Version)
Manuel de Sousa Ribeiro
Afonso Leote
João Leite
197
1
0
24 Jul 2025
Concept Probing: Where to Find Human-Defined Concepts (Extended Version)
Concept Probing: Where to Find Human-Defined Concepts (Extended Version)
Manuel de Sousa Ribeiro
Afonso Leote
João Leite
189
1
0
24 Jul 2025
Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework
Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework
Laura Kopf
Nils Feldhus
Kirill Bykov
P. Bommer
Anna Hedström
Marina M.-C. Höhne
Oliver Eberle
409
4
0
18 Jun 2025
Evaluating Neuron Explanations: A Unified Framework with Sanity Checks
Evaluating Neuron Explanations: A Unified Framework with Sanity Checks
Tuomas P. Oikarinen
Ge Yan
Tsui-Wei Weng
FAttXAI
175
7
0
06 Jun 2025
Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors
Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors
Jing Huang
Junyi Tao
Thomas Icard
Diyi Yang
Christopher Potts
OODD
454
4
0
17 May 2025
Disentangling Polysemantic Channels in Convolutional Neural Networks
Disentangling Polysemantic Channels in Convolutional Neural Networks
Robin Hesse
Jonas Fischer
Simone Schaub-Meyer
Stefan Roth
FAttMILM
270
3
0
17 Apr 2025
Following the Whispers of Values: Unraveling Neural Mechanisms Behind Value-Oriented Behaviors in LLMs
Following the Whispers of Values: Unraveling Neural Mechanisms Behind Value-Oriented Behaviors in LLMs
Ling Hu
Yuemei Xu
Xiaoyang Gu
Letao Han
389
1
0
07 Apr 2025
HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks
HyperDAS: Towards Automating Mechanistic Interpretability with HypernetworksInternational Conference on Learning Representations (ICLR), 2025
Jiuding Sun
Jing Huang
Sidharth Baskaran
Karel DÓosterlinck
Christopher Potts
Michael Sklar
Atticus Geiger
AI4CE
430
5
0
13 Mar 2025
Steered Generation via Gradient Descent on Sparse Features
Steered Generation via Gradient Descent on Sparse Features
Sumanta Bhattacharyya
Pedram Rooshenas
LLMSV
304
0
0
25 Feb 2025
On Relation-Specific Neurons in Large Language Models
On Relation-Specific Neurons in Large Language Models
Yihong Liu
Runsheng Chen
Lea Hirlimann
Ahmad Dawar Hakimi
Mingyang Wang
Amir Hossein Kargaran
S. Rothe
François Yvon
Hinrich Schütze
KELM
311
0
0
24 Feb 2025
NeurFlow: Interpreting Neural Networks through Neuron Groups and Functional Interactions
NeurFlow: Interpreting Neural Networks through Neuron Groups and Functional InteractionsInternational Conference on Learning Representations (ICLR), 2025
Tue Cao
Nhat X. Hoang
Hieu H. Pham
P. Nguyen
My T. Thai
551
2
0
22 Feb 2025
LaVCa: LLM-assisted Visual Cortex Captioning
LaVCa: LLM-assisted Visual Cortex Captioning
Takuya Matsuyama
Shinji Nishimoto
Yu Takagi
318
3
0
20 Feb 2025
Discovering Chunks in Neural Embeddings for Interpretability
Discovering Chunks in Neural Embeddings for Interpretability
Shuchen Wu
Stephan Alaniz
Eric Schulz
Zeynep Akata
295
0
0
03 Feb 2025
Compositional Concept-Based Neuron-Level Interpretability for Deep Reinforcement Learning
Compositional Concept-Based Neuron-Level Interpretability for Deep Reinforcement Learning
Zeyu Jiang
Hai Huang
Xingquan Zuo
OffRL
212
0
0
02 Feb 2025
Towards Utilising a Range of Neural Activations for Comprehending
  Representational Associations
Towards Utilising a Range of Neural Activations for Comprehending Representational AssociationsIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2024
Laura O'Mahony
Nikola S. Nikolov
David JP O'Sullivan
448
2
0
15 Nov 2024
Understanding Internal Representations of Recommendation Models with Sparse Autoencoders
Understanding Internal Representations of Recommendation Models with Sparse Autoencoders
Jiayin Wang
Xiaoyu Zhang
Weizhi Ma
Zhiqiang Guo
Min Zhang
278
4
0
09 Nov 2024
Beyond Interpretability: The Gains of Feature Monosemanticity on Model
  Robustness
Beyond Interpretability: The Gains of Feature Monosemanticity on Model RobustnessInternational Conference on Learning Representations (ICLR), 2024
Qi Zhang
Yifei Wang
Jingyi Cui
Xiang Pan
Qi Lei
Stefanie Jegelka
Yisen Wang
AAML
302
4
0
27 Oct 2024
Hypothesis Testing the Circuit Hypothesis in LLMs
Hypothesis Testing the Circuit Hypothesis in LLMsNeural Information Processing Systems (NeurIPS), 2024
Claudia Shi
Nicolas Beltran-Velez
Achille Nazaret
Carolina Zheng
Adrià Garriga-Alonso
Andrew Jesson
Maggie Makar
David M. Blei
266
19
0
16 Oct 2024
Neuron-based Personality Trait Induction in Large Language Models
Neuron-based Personality Trait Induction in Large Language Models
Jia Deng
Tianyi Tang
Yanbin Yin
Wenhao Yang
Wayne Xin Zhao
Ji-Rong Wen
252
4
0
16 Oct 2024
Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family Experts
Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family ExpertsInternational Conference on Learning Representations (ICLR), 2024
Guorui Zheng
Xidong Wang
Juhao Liang
Nuo Chen
Yuping Zheng
Benyou Wang
MoE
315
11
0
14 Oct 2024
Investigating Representation Universality: Case Study on Genealogical Representations
Investigating Representation Universality: Case Study on Genealogical Representations
David D. Baek
Yuxiao Li
Max Tegmark
273
3
0
10 Oct 2024
Mechanistic?
Mechanistic?BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackBoxNLP), 2024
Naomi Saphra
Sarah Wiegreffe
AI4CE
263
34
0
07 Oct 2024
Linking in Style: Understanding learned features in deep learning models
Linking in Style: Understanding learned features in deep learning modelsEuropean Conference on Computer Vision (ECCV), 2024
Maren H. Wehrheim
Pamela Osuna-Vargas
Matthias Kaschube
GAN
213
0
0
25 Sep 2024
Unveiling Language Competence Neurons: A Psycholinguistic Approach to
  Model Interpretability
Unveiling Language Competence Neurons: A Psycholinguistic Approach to Model InterpretabilityInternational Conference on Computational Linguistics (COLING), 2024
Xufeng Duan
Xinyu Zhou
Bei Xiao
Zhenguang G. Cai
MILM
215
9
0
24 Sep 2024
Optimal ablation for interpretability
Optimal ablation for interpretabilityNeural Information Processing Systems (NeurIPS), 2024
Maximilian Li
Lucas Janson
FAtt
343
12
0
16 Sep 2024
Interpreting and Improving Large Language Models in Arithmetic
  Calculation
Interpreting and Improving Large Language Models in Arithmetic CalculationInternational Conference on Machine Learning (ICML), 2024
Wei Zhang
Chaoqun Wan
Yonggang Zhang
Yiu-ming Cheung
Xinmei Tian
Xu Shen
Jieping Ye
LRM
342
38
0
03 Sep 2024
Towards Symbolic XAI -- Explanation Through Human Understandable Logical
  Relationships Between Features
Towards Symbolic XAI -- Explanation Through Human Understandable Logical Relationships Between FeaturesInformation Fusion (Inf. Fusion), 2024
Thomas Schnake
Farnoush Rezaei Jafaria
Jonas Lederer
Ping Xiong
Shinichi Nakajima
Stefan Gugler
G. Montavon
Klaus-Robert Müller
321
8
0
30 Aug 2024
Unsupervised Composable Representations for Audio
Unsupervised Composable Representations for AudioInternational Society for Music Information Retrieval Conference (ISMIR), 2024
Giovanni Bindi
P. Esling
DiffMOCLCoGe
290
3
0
19 Aug 2024
Interpreting Attention Layer Outputs with Sparse Autoencoders
Interpreting Attention Layer Outputs with Sparse Autoencoders
Connor Kissane
Robert Krzyzanowski
Joseph Isaac Bloom
Arthur Conmy
Neel Nanda
MILM
267
37
0
25 Jun 2024
MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in
  Multimodal Large Language Model
MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model
Jiahao Huo
Yibo Yan
Boren Hu
Yutao Yue
Xuming Hu
LRMMLLM
266
16
0
17 Jun 2024
Talking Heads: Understanding Inter-layer Communication in Transformer Language Models
Talking Heads: Understanding Inter-layer Communication in Transformer Language Models
Jack Merullo
Carsten Eickhoff
Ellie Pavlick
560
34
0
13 Jun 2024
LLM-assisted Concept Discovery: Automatically Identifying and Explaining
  Neuron Functions
LLM-assisted Concept Discovery: Automatically Identifying and Explaining Neuron Functions
N. Hoang-Xuan
Minh Nhat Vu
My T. Thai
228
5
0
12 Jun 2024
Graphical Perception of Saliency-based Model Explanations
Graphical Perception of Saliency-based Model Explanations
Yayan Zhao
Mingwei Li
Matthew Berger
XAIFAtt
342
2
0
11 Jun 2024
Position: An Inner Interpretability Framework for AI Inspired by Lessons
  from Cognitive Neuroscience
Position: An Inner Interpretability Framework for AI Inspired by Lessons from Cognitive Neuroscience
Martina G. Vilas
Federico Adolfi
David Poeppel
Gemma Roig
313
10
0
03 Jun 2024
CoSy: Evaluating Textual Explanations of Neurons
CoSy: Evaluating Textual Explanations of Neurons
Laura Kopf
P. Bommer
Anna Hedström
Sebastian Lapuschkin
Marina M.-C. Höhne
Kirill Bykov
210
19
0
30 May 2024
Linear Explanations for Individual Neurons
Linear Explanations for Individual Neurons
Tuomas P. Oikarinen
Tsui-Wei Weng
FAttMILM
265
15
0
10 May 2024
Mechanistic Interpretability for AI Safety -- A Review
Mechanistic Interpretability for AI Safety -- A Review
Leonard Bereska
E. Gavves
AI4CE
386
307
0
22 Apr 2024
A Multimodal Automated Interpretability Agent
A Multimodal Automated Interpretability Agent
Tamar Rott Shaham
Sarah Schwettmann
Franklin Wang
Achyuta Rajaram
Evan Hernandez
Jacob Andreas
Antonio Torralba
533
45
0
22 Apr 2024
Decomposing and Editing Predictions by Modeling Model Computation
Decomposing and Editing Predictions by Modeling Model Computation
Harshay Shah
Andrew Ilyas
Aleksander Madry
KELM
297
24
0
17 Apr 2024
The SaTML '24 CNN Interpretability Competition: New Innovations for
  Concept-Level Interpretability
The SaTML '24 CNN Interpretability Competition: New Innovations for Concept-Level Interpretability
Stephen Casper
Jieun Yun
Joonhyuk Baek
Yeseong Jung
Minhwan Kim
...
A. Nicolson
Arush Tagade
Jessica Rumbelow
Hieu Minh Nguyen
Dylan Hadfield-Menell
284
2
0
03 Apr 2024
WWW: A Unified Framework for Explaining What, Where and Why of Neural
  Networks by Interpretation of Neuron Concepts
WWW: A Unified Framework for Explaining What, Where and Why of Neural Networks by Interpretation of Neuron Concepts
Yong Hyun Ahn
Hyeon Bae Kim
Seong Tae Kim
267
14
0
29 Feb 2024
Language-Specific Neurons: The Key to Multilingual Capabilities in Large
  Language Models
Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models
Tianyi Tang
Wenyang Luo
Haoyang Huang
Dongdong Zhang
Xiaolei Wang
Xin Zhao
Furu Wei
Ji-Rong Wen
363
95
0
26 Feb 2024
123
Next
Page 1 of 3