Is This the Subspace You Are Looking for? An Interpretability Illusion
for Subspace Activation Patching

v1v2 (latest)

Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching

28 November 2023

Aleksandar Makelov

ArXiv (abs)PDF HTML

Papers citing "Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching"

19 / 19 papers shown

Title
Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii Kola Ayonrinde Louis Jaburi XAI 147 1 0 02 May 2025
Interpreting the linear structure of vision-language model embedding spaces Isabel Papadimitriou Huangyuan Su Thomas Fel Naomi Saphra Sham Kakade VLM 118 1 0 16 Apr 2025
Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable? Maxime Méloux Silviu Maniu François Portet Maxime Peyrard 104 1 0 28 Feb 2025
Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models Thomas Fel Ekdeep Singh Lubana Jacob S. Prince M. Kowal Victor Boutin Isabel Papadimitriou Binxu Wang Martin Wattenberg Demba Ba Talia Konkle 68 8 0 18 Feb 2025
Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering Zeping Yu Sophia Ananiadou 479 2 0 17 Nov 2024
Towards Utilising a Range of Neural Activations for Comprehending Representational Associations Laura O'Mahony Nikola S. Nikolov David JP O'Sullivan 133 0 0 15 Nov 2024
Sparse Attention Decomposition Applied to Circuit Tracing Gabriel Franco Mark Crovella 55 0 0 01 Oct 2024
Optimal ablation for interpretability Maximilian Li Lucas Janson FAtt 104 3 0 16 Sep 2024
Relational Composition in Neural Networks: A Survey and Call to Action Martin Wattenberg Fernanda Viégas CoGe 76 10 0 19 Jul 2024
Transformer Circuit Faithfulness Metrics are not Robust Joseph Miller Bilal Chughtai William Saunders 104 7 0 11 Jul 2024
Anthropocentric bias in language model evaluation Raphael Milliere Charles Rathkopf 80 3 0 04 Jul 2024
Finding Transformer Circuits with Edge Pruning Adithya Bhaskar Alexander Wettig Dan Friedman Danqi Chen 210 20 0 24 Jun 2024
Beyond the Doors of Perception: Vision Transformers Represent Relations Between Objects Michael A. Lepori Alexa R. Tartaglini Wai Keen Vong Thomas Serre Brenden M. Lake Ellie Pavlick 91 4 0 22 Jun 2024
Talking Heads: Understanding Inter-layer Communication in Transformer Language Models Jack Merullo Carsten Eickhoff Ellie Pavlick 133 16 0 13 Jun 2024
ReFT: Representation Finetuning for Language Models Zhengxuan Wu Aryaman Arora Zheng Wang Atticus Geiger Daniel Jurafsky Christopher D. Manning Christopher Potts OffRL 114 72 0 04 Apr 2024
Interpreting Key Mechanisms of Factual Recall in Transformer-Based Language Models Ang Lv Yuhan Chen Kaiyi Zhang Yulong Wang Lifeng Liu Ji-Rong Wen Jian Xie Rui Yan KELM 76 18 0 28 Mar 2024
Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE) Usha Bhalla Alexander X. Oesterling Suraj Srinivas Flavio du Pin Calmon Himabindu Lakkaraju 122 44 0 16 Feb 2024
A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments Zhengxuan Wu Atticus Geiger Jing-ling Huang Aryaman Arora Thomas Icard Christopher Potts Noah D. Goodman 65 6 0 23 Jan 2024
An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l James Dao Yeu-Tong Lau Can Rager Jett Janiak 107 5 0 11 Oct 2023