Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models

v1v2 (latest)

Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models

18 February 2025

Ekdeep Singh Lubana

Jacob S. Prince

Isabel Papadimitriou

Martin Wattenberg

ArXiv (abs)PDF HTML

Papers citing "Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models"

9 / 9 papers shown

Title
From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit Valérie Costa Thomas Fel Ekdeep Singh Lubana Bahareh Tolooshams Demba Ba 49 0 0 03 Jun 2025
BehaviorBox: Automated Discovery of Fine-Grained Performance Differences Between Language Models Lindia Tjuatja Graham Neubig 28 0 0 02 Jun 2025
Ensembling Sparse Autoencoders Soham Gadgil Chris Lin Su-In Lee 79 0 0 21 May 2025
Interpreting the linear structure of vision-language model embedding spaces Isabel Papadimitriou Huangyuan Su Thomas Fel Naomi Saphra Sham Kakade VLM 120 1 0 16 Apr 2025
Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry Sai Sumedh R. Hindupur Ekdeep Singh Lubana Thomas Fel Demba Ba 105 10 0 03 Mar 2025
Archetypal Analysis for Binary Data Anna Emilie J. Wedenborg Morten Mørup 155 2 0 06 Feb 2025
Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment Harrish Thasarathan Julian Forsyth Thomas Fel M. Kowal Konstantinos G. Derpanis 146 10 0 06 Feb 2025
Towards Unifying Interpretability and Control: Evaluation via Intervention Usha Bhalla Suraj Srinivas Asma Ghandeharioun Himabindu Lakkaraju 108 11 0 07 Nov 2024
Analyzing (In)Abilities of SAEs via Formal Languages Abhinav Menon Manish Shrivastava David M. Krueger Ekdeep Singh Lubana 109 8 0 15 Oct 2024