Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2310.17230
Cited By
Codebook Features: Sparse and Discrete Interpretability for Neural Networks
26 October 2023
Alex Tamkin
Mohammad Taufeeque
Noah D. Goodman
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Codebook Features: Sparse and Discrete Interpretability for Neural Networks"
29 / 29 papers shown
Title
Self-Ablating Transformers: More Interpretability, Less Sparsity
Jeremias Ferrao
Luhan Mikaelson
Keenan Pepper
Natalia Perez-Campanero Antolin
MILM
21
0
0
01 May 2025
Interpreting the Linear Structure of Vision-language Model Embedding Spaces
Isabel Papadimitriou
Huangyuan Su
Thomas Fel
Naomi Saphra
Sham Kakade
Stephanie Gil
VLM
40
0
0
16 Apr 2025
Human Motion Unlearning
Edoardo De Matteis
Matteo Migliarini
Alessio Sampieri
Indro Spinelli
Fabio Galasso
MU
55
0
0
24 Mar 2025
Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment
Harrish Thasarathan
Julian Forsyth
Thomas Fel
M. Kowal
Konstantinos G. Derpanis
100
7
0
06 Feb 2025
Towards scientific discovery with dictionary learning: Extracting biological concepts from microscopy foundation models
Konstantin Donhauser
Kristina Ulicna
Gemma Elyse Moran
Aditya Ravuri
Kian Kenyon-Dean
Cian Eastwood
Jason Hartford
76
0
0
20 Dec 2024
Local vs distributed representations: What is the right basis for interpretability?
Julien Colin
L. Goetschalckx
Thomas Fel
Victor Boutin
Jay Gopal
Thomas Serre
Nuria Oliver
HAI
21
2
0
06 Nov 2024
Residual vector quantization for KV cache compression in large language model
Ankur Kumar
MQ
29
0
0
21 Oct 2024
CodeUnlearn: Amortized Zero-Shot Machine Unlearning in Language Models Using Discrete Concept
YuXuan Wu
Bonaventure F. P. Dossou
Dianbo Liu
MU
16
0
0
08 Oct 2024
Mathematical Models of Computation in Superposition
Kaarel Hänni
Jake Mendel
Dmitry Vaintrob
Lawrence Chan
SupR
20
7
0
10 Aug 2024
The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability
Aaron Mueller
Jannik Brinkmann
Millicent Li
Samuel Marks
Koyena Pal
...
Arnab Sen Sharma
Jiuding Sun
Eric Todd
David Bau
Yonatan Belinkov
CML
42
18
0
02 Aug 2024
The Interpretability of Codebooks in Model-Based Reinforcement Learning is Limited
Kenneth Eaton
Jonathan C. Balloch
Julia Kim
Mark O. Riedl
FAtt
OffRL
21
0
0
28 Jul 2024
Knowledge Mechanisms in Large Language Models: A Survey and Perspective
Meng Wang
Yunzhi Yao
Ziwen Xu
Shuofei Qiao
Shumin Deng
...
Yong-jia Jiang
Pengjun Xie
Fei Huang
Huajun Chen
Ningyu Zhang
47
27
0
22 Jul 2024
Neural Concept Binder
Wolfgang Stammer
Antonia Wüst
David Steinmann
Kristian Kersting
OCL
29
4
0
14 Jun 2024
Discrete Dictionary-based Decomposition Layer for Structured Representation Learning
Taewon Park
Hyun-Chul Kim
Minho Lee
31
0
0
11 Jun 2024
InversionView: A General-Purpose Method for Reading Information from Neural Activations
Xinting Huang
Madhur Panwar
Navin Goyal
Michael Hahn
26
3
0
27 May 2024
Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models
Charles OÑeill
Thang Bui
25
5
0
21 May 2024
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Dan Braun
Jordan K. Taylor
Nicholas Goldowsky-Dill
Lee D. Sharkey
21
37
0
17 May 2024
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
Aleksandar Makelov
Georg Lange
Neel Nanda
24
33
0
14 May 2024
Improving Dictionary Learning with Gated Sparse Autoencoders
Senthooran Rajamanoharan
Arthur Conmy
Lewis Smith
Tom Lieberum
Vikrant Varma
János Kramár
Rohin Shah
Neel Nanda
RALM
20
78
0
24 Apr 2024
Understanding the role of FFNs in driving multilingual behaviour in LLMs
Sunit Bhattacharya
Ondrej Bojar
21
2
0
22 Apr 2024
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
Jing-ling Huang
Zhengxuan Wu
Christopher Potts
Mor Geva
Atticus Geiger
55
26
0
27 Feb 2024
Backward Lens: Projecting Language Model Gradients into the Vocabulary Space
Shahar Katz
Yonatan Belinkov
Mor Geva
Lior Wolf
47
10
1
20 Feb 2024
Symbolic Autoencoding for Self-Supervised Sequence Learning
Mohammad Hossein Amani
Nicolas Mario Baldwin
Amin Mansouri
Martin Josifoski
Maxime Peyrard
Robert West
13
1
0
16 Feb 2024
Improving Semantic Control in Discrete Latent Spaces with Transformer Quantized Variational Autoencoders
Yingji Zhang
Danilo S. Carvalho
Marco Valentino
Ian Pratt-Hartmann
André Freitas
DRL
38
5
0
01 Feb 2024
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
Atticus Geiger
Zhengxuan Wu
Christopher Potts
Thomas F. Icard
Noah D. Goodman
CML
73
98
0
05 Mar 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
210
491
0
01 Nov 2022
Post-hoc Concept Bottleneck Models
Mert Yuksekgonul
Maggie Wang
James Y. Zou
133
183
0
31 May 2022
Editing a classifier by rewriting its prediction rules
Shibani Santurkar
Dimitris Tsipras
Mahalaxmi Elango
David Bau
Antonio Torralba
A. Madry
KELM
175
89
0
02 Dec 2021
Fast Model Editing at Scale
E. Mitchell
Charles Lin
Antoine Bosselut
Chelsea Finn
Christopher D. Manning
KELM
219
341
0
21 Oct 2021
1