Codebook Features: Sparse and Discrete Interpretability for Neural Networks

26 October 2023

Papers citing "Codebook Features: Sparse and Discrete Interpretability for Neural Networks"

29 / 29 papers shown

Title
Self-Ablating Transformers: More Interpretability, Less Sparsity Jeremias Ferrao Luhan Mikaelson Keenan Pepper Natalia Perez-Campanero Antolin MILM 21 0 0 01 May 2025
Interpreting the Linear Structure of Vision-language Model Embedding Spaces Isabel Papadimitriou Huangyuan Su Thomas Fel Naomi Saphra Sham Kakade Stephanie Gil VLM 40 0 0 16 Apr 2025
Human Motion Unlearning Edoardo De Matteis Matteo Migliarini Alessio Sampieri Indro Spinelli Fabio Galasso MU 55 0 0 24 Mar 2025
Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment Harrish Thasarathan Julian Forsyth Thomas Fel M. Kowal Konstantinos G. Derpanis 100 7 0 06 Feb 2025
Towards scientific discovery with dictionary learning: Extracting biological concepts from microscopy foundation models Konstantin Donhauser Kristina Ulicna Gemma Elyse Moran Aditya Ravuri Kian Kenyon-Dean Cian Eastwood Jason Hartford 76 0 0 20 Dec 2024
Local vs distributed representations: What is the right basis for interpretability? Julien Colin L. Goetschalckx Thomas Fel Victor Boutin Jay Gopal Thomas Serre Nuria Oliver HAI 21 2 0 06 Nov 2024
Residual vector quantization for KV cache compression in large language model Ankur Kumar MQ 29 0 0 21 Oct 2024
CodeUnlearn: Amortized Zero-Shot Machine Unlearning in Language Models Using Discrete Concept YuXuan Wu Bonaventure F. P. Dossou Dianbo Liu MU 16 0 0 08 Oct 2024
Mathematical Models of Computation in Superposition Kaarel Hänni Jake Mendel Dmitry Vaintrob Lawrence Chan SupR 20 7 0 10 Aug 2024
The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability Aaron Mueller Jannik Brinkmann Millicent Li Samuel Marks Koyena Pal ... Arnab Sen Sharma Jiuding Sun Eric Todd David Bau Yonatan Belinkov CML 42 18 0 02 Aug 2024
The Interpretability of Codebooks in Model-Based Reinforcement Learning is Limited Kenneth Eaton Jonathan C. Balloch Julia Kim Mark O. Riedl FAtt OffRL 21 0 0 28 Jul 2024
Knowledge Mechanisms in Large Language Models: A Survey and Perspective Meng Wang Yunzhi Yao Ziwen Xu Shuofei Qiao Shumin Deng ... Yong-jia Jiang Pengjun Xie Fei Huang Huajun Chen Ningyu Zhang 47 27 0 22 Jul 2024
Neural Concept Binder Wolfgang Stammer Antonia Wüst David Steinmann Kristian Kersting OCL 29 4 0 14 Jun 2024
Discrete Dictionary-based Decomposition Layer for Structured Representation Learning Taewon Park Hyun-Chul Kim Minho Lee 31 0 0 11 Jun 2024
InversionView: A General-Purpose Method for Reading Information from Neural Activations Xinting Huang Madhur Panwar Navin Goyal Michael Hahn 26 3 0 27 May 2024
Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models Charles OÑeill Thang Bui 25 5 0 21 May 2024
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning Dan Braun Jordan K. Taylor Nicholas Goldowsky-Dill Lee D. Sharkey 21 37 0 17 May 2024
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control Aleksandar Makelov Georg Lange Neel Nanda 24 33 0 14 May 2024
Improving Dictionary Learning with Gated Sparse Autoencoders Senthooran Rajamanoharan Arthur Conmy Lewis Smith Tom Lieberum Vikrant Varma János Kramár Rohin Shah Neel Nanda RALM 20 78 0 24 Apr 2024
Understanding the role of FFNs in driving multilingual behaviour in LLMs Sunit Bhattacharya Ondrej Bojar 21 2 0 22 Apr 2024
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations Jing-ling Huang Zhengxuan Wu Christopher Potts Mor Geva Atticus Geiger 55 26 0 27 Feb 2024
Backward Lens: Projecting Language Model Gradients into the Vocabulary Space Shahar Katz Yonatan Belinkov Mor Geva Lior Wolf 47 10 1 20 Feb 2024
Symbolic Autoencoding for Self-Supervised Sequence Learning Mohammad Hossein Amani Nicolas Mario Baldwin Amin Mansouri Martin Josifoski Maxime Peyrard Robert West 13 1 0 16 Feb 2024
Improving Semantic Control in Discrete Latent Spaces with Transformer Quantized Variational Autoencoders Yingji Zhang Danilo S. Carvalho Marco Valentino Ian Pratt-Hartmann André Freitas DRL 38 5 0 01 Feb 2024
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations Atticus Geiger Zhengxuan Wu Christopher Potts Thomas F. Icard Noah D. Goodman CML 73 98 0 05 Mar 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 210 491 0 01 Nov 2022
Post-hoc Concept Bottleneck Models Mert Yuksekgonul Maggie Wang James Y. Zou 133 183 0 31 May 2022
Editing a classifier by rewriting its prediction rules Shibani Santurkar Dimitris Tsipras Mahalaxmi Elango David Bau Antonio Torralba A. Madry KELM 175 89 0 02 Dec 2021
Fast Model Editing at Scale E. Mitchell Charles Lin Antoine Bosselut Chelsea Finn Christopher D. Manning KELM 219 341 0 21 Oct 2021