ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2311.17030
  4. Cited By
Is This the Subspace You Are Looking for? An Interpretability Illusion
  for Subspace Activation Patching
v1v2 (latest)

Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching

28 November 2023
Aleksandar Makelov
Georg Lange
Neel Nanda
ArXiv (abs)PDFHTML

Papers citing "Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching"

19 / 19 papers shown
Title
Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii
Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii
Kola Ayonrinde
Louis Jaburi
XAI
147
1
0
02 May 2025
Interpreting the linear structure of vision-language model embedding spaces
Interpreting the linear structure of vision-language model embedding spaces
Isabel Papadimitriou
Huangyuan Su
Thomas Fel
Naomi Saphra
Sham Kakade
VLM
118
1
0
16 Apr 2025
Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable?
Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable?
Maxime Méloux
Silviu Maniu
François Portet
Maxime Peyrard
104
1
0
28 Feb 2025
Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models
Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models
Thomas Fel
Ekdeep Singh Lubana
Jacob S. Prince
M. Kowal
Victor Boutin
Isabel Papadimitriou
Binxu Wang
Martin Wattenberg
Demba Ba
Talia Konkle
68
8
0
18 Feb 2025
Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering
Zeping Yu
Sophia Ananiadou
479
2
0
17 Nov 2024
Towards Utilising a Range of Neural Activations for Comprehending
  Representational Associations
Towards Utilising a Range of Neural Activations for Comprehending Representational Associations
Laura O'Mahony
Nikola S. Nikolov
David JP O'Sullivan
133
0
0
15 Nov 2024
Sparse Attention Decomposition Applied to Circuit Tracing
Sparse Attention Decomposition Applied to Circuit Tracing
Gabriel Franco
Mark Crovella
55
0
0
01 Oct 2024
Optimal ablation for interpretability
Optimal ablation for interpretability
Maximilian Li
Lucas Janson
FAtt
104
3
0
16 Sep 2024
Relational Composition in Neural Networks: A Survey and Call to Action
Relational Composition in Neural Networks: A Survey and Call to Action
Martin Wattenberg
Fernanda Viégas
CoGe
76
10
0
19 Jul 2024
Transformer Circuit Faithfulness Metrics are not Robust
Transformer Circuit Faithfulness Metrics are not Robust
Joseph Miller
Bilal Chughtai
William Saunders
104
7
0
11 Jul 2024
Anthropocentric bias in language model evaluation
Anthropocentric bias in language model evaluation
Raphael Milliere
Charles Rathkopf
80
3
0
04 Jul 2024
Finding Transformer Circuits with Edge Pruning
Finding Transformer Circuits with Edge Pruning
Adithya Bhaskar
Alexander Wettig
Dan Friedman
Danqi Chen
210
20
0
24 Jun 2024
Beyond the Doors of Perception: Vision Transformers Represent Relations
  Between Objects
Beyond the Doors of Perception: Vision Transformers Represent Relations Between Objects
Michael A. Lepori
Alexa R. Tartaglini
Wai Keen Vong
Thomas Serre
Brenden M. Lake
Ellie Pavlick
91
4
0
22 Jun 2024
Talking Heads: Understanding Inter-layer Communication in Transformer Language Models
Talking Heads: Understanding Inter-layer Communication in Transformer Language Models
Jack Merullo
Carsten Eickhoff
Ellie Pavlick
133
16
0
13 Jun 2024
ReFT: Representation Finetuning for Language Models
ReFT: Representation Finetuning for Language Models
Zhengxuan Wu
Aryaman Arora
Zheng Wang
Atticus Geiger
Daniel Jurafsky
Christopher D. Manning
Christopher Potts
OffRL
114
72
0
04 Apr 2024
Interpreting Key Mechanisms of Factual Recall in Transformer-Based
  Language Models
Interpreting Key Mechanisms of Factual Recall in Transformer-Based Language Models
Ang Lv
Yuhan Chen
Kaiyi Zhang
Yulong Wang
Lifeng Liu
Ji-Rong Wen
Jian Xie
Rui Yan
KELM
76
18
0
28 Mar 2024
Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE)
Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE)
Usha Bhalla
Alexander X. Oesterling
Suraj Srinivas
Flavio du Pin Calmon
Himabindu Lakkaraju
122
44
0
16 Feb 2024
A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments
A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments
Zhengxuan Wu
Atticus Geiger
Jing-ling Huang
Aryaman Arora
Thomas Icard
Christopher Potts
Noah D. Goodman
65
6
0
23 Jan 2024
An Adversarial Example for Direct Logit Attribution: Memory Management
  in gelu-4l
An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l
James Dao
Yeu-Tong Lau
Can Rager
Jett Janiak
107
5
0
11 Oct 2023
1