ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2310.17230
  4. Cited By
Codebook Features: Sparse and Discrete Interpretability for Neural
  Networks

Codebook Features: Sparse and Discrete Interpretability for Neural Networks

26 October 2023
Alex Tamkin
Mohammad Taufeeque
Noah D. Goodman
ArXivPDFHTML

Papers citing "Codebook Features: Sparse and Discrete Interpretability for Neural Networks"

29 / 29 papers shown
Title
Self-Ablating Transformers: More Interpretability, Less Sparsity
Self-Ablating Transformers: More Interpretability, Less Sparsity
Jeremias Ferrao
Luhan Mikaelson
Keenan Pepper
Natalia Perez-Campanero Antolin
MILM
21
0
0
01 May 2025
Interpreting the Linear Structure of Vision-language Model Embedding Spaces
Interpreting the Linear Structure of Vision-language Model Embedding Spaces
Isabel Papadimitriou
Huangyuan Su
Thomas Fel
Naomi Saphra
Sham Kakade
Stephanie Gil
VLM
40
0
0
16 Apr 2025
Human Motion Unlearning
Human Motion Unlearning
Edoardo De Matteis
Matteo Migliarini
Alessio Sampieri
Indro Spinelli
Fabio Galasso
MU
55
0
0
24 Mar 2025
Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment
Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment
Harrish Thasarathan
Julian Forsyth
Thomas Fel
M. Kowal
Konstantinos G. Derpanis
100
7
0
06 Feb 2025
Towards scientific discovery with dictionary learning: Extracting biological concepts from microscopy foundation models
Towards scientific discovery with dictionary learning: Extracting biological concepts from microscopy foundation models
Konstantin Donhauser
Kristina Ulicna
Gemma Elyse Moran
Aditya Ravuri
Kian Kenyon-Dean
Cian Eastwood
Jason Hartford
76
0
0
20 Dec 2024
Local vs distributed representations: What is the right basis for
  interpretability?
Local vs distributed representations: What is the right basis for interpretability?
Julien Colin
L. Goetschalckx
Thomas Fel
Victor Boutin
Jay Gopal
Thomas Serre
Nuria Oliver
HAI
21
2
0
06 Nov 2024
Residual vector quantization for KV cache compression in large language
  model
Residual vector quantization for KV cache compression in large language model
Ankur Kumar
MQ
29
0
0
21 Oct 2024
CodeUnlearn: Amortized Zero-Shot Machine Unlearning in Language Models
  Using Discrete Concept
CodeUnlearn: Amortized Zero-Shot Machine Unlearning in Language Models Using Discrete Concept
YuXuan Wu
Bonaventure F. P. Dossou
Dianbo Liu
MU
16
0
0
08 Oct 2024
Mathematical Models of Computation in Superposition
Mathematical Models of Computation in Superposition
Kaarel Hänni
Jake Mendel
Dmitry Vaintrob
Lawrence Chan
SupR
20
7
0
10 Aug 2024
The Quest for the Right Mediator: A History, Survey, and Theoretical
  Grounding of Causal Interpretability
The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability
Aaron Mueller
Jannik Brinkmann
Millicent Li
Samuel Marks
Koyena Pal
...
Arnab Sen Sharma
Jiuding Sun
Eric Todd
David Bau
Yonatan Belinkov
CML
42
18
0
02 Aug 2024
The Interpretability of Codebooks in Model-Based Reinforcement Learning
  is Limited
The Interpretability of Codebooks in Model-Based Reinforcement Learning is Limited
Kenneth Eaton
Jonathan C. Balloch
Julia Kim
Mark O. Riedl
FAtt
OffRL
21
0
0
28 Jul 2024
Knowledge Mechanisms in Large Language Models: A Survey and Perspective
Knowledge Mechanisms in Large Language Models: A Survey and Perspective
Meng Wang
Yunzhi Yao
Ziwen Xu
Shuofei Qiao
Shumin Deng
...
Yong-jia Jiang
Pengjun Xie
Fei Huang
Huajun Chen
Ningyu Zhang
47
27
0
22 Jul 2024
Neural Concept Binder
Neural Concept Binder
Wolfgang Stammer
Antonia Wüst
David Steinmann
Kristian Kersting
OCL
29
4
0
14 Jun 2024
Discrete Dictionary-based Decomposition Layer for Structured
  Representation Learning
Discrete Dictionary-based Decomposition Layer for Structured Representation Learning
Taewon Park
Hyun-Chul Kim
Minho Lee
31
0
0
11 Jun 2024
InversionView: A General-Purpose Method for Reading Information from
  Neural Activations
InversionView: A General-Purpose Method for Reading Information from Neural Activations
Xinting Huang
Madhur Panwar
Navin Goyal
Michael Hahn
26
3
0
27 May 2024
Sparse Autoencoders Enable Scalable and Reliable Circuit Identification
  in Language Models
Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models
Charles OÑeill
Thang Bui
25
5
0
21 May 2024
Identifying Functionally Important Features with End-to-End Sparse
  Dictionary Learning
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Dan Braun
Jordan K. Taylor
Nicholas Goldowsky-Dill
Lee D. Sharkey
21
37
0
17 May 2024
Towards Principled Evaluations of Sparse Autoencoders for
  Interpretability and Control
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
Aleksandar Makelov
Georg Lange
Neel Nanda
24
33
0
14 May 2024
Improving Dictionary Learning with Gated Sparse Autoencoders
Improving Dictionary Learning with Gated Sparse Autoencoders
Senthooran Rajamanoharan
Arthur Conmy
Lewis Smith
Tom Lieberum
Vikrant Varma
János Kramár
Rohin Shah
Neel Nanda
RALM
20
78
0
24 Apr 2024
Understanding the role of FFNs in driving multilingual behaviour in LLMs
Understanding the role of FFNs in driving multilingual behaviour in LLMs
Sunit Bhattacharya
Ondrej Bojar
21
2
0
22 Apr 2024
RAVEL: Evaluating Interpretability Methods on Disentangling Language
  Model Representations
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
Jing-ling Huang
Zhengxuan Wu
Christopher Potts
Mor Geva
Atticus Geiger
55
26
0
27 Feb 2024
Backward Lens: Projecting Language Model Gradients into the Vocabulary
  Space
Backward Lens: Projecting Language Model Gradients into the Vocabulary Space
Shahar Katz
Yonatan Belinkov
Mor Geva
Lior Wolf
47
10
1
20 Feb 2024
Symbolic Autoencoding for Self-Supervised Sequence Learning
Symbolic Autoencoding for Self-Supervised Sequence Learning
Mohammad Hossein Amani
Nicolas Mario Baldwin
Amin Mansouri
Martin Josifoski
Maxime Peyrard
Robert West
13
1
0
16 Feb 2024
Improving Semantic Control in Discrete Latent Spaces with Transformer
  Quantized Variational Autoencoders
Improving Semantic Control in Discrete Latent Spaces with Transformer Quantized Variational Autoencoders
Yingji Zhang
Danilo S. Carvalho
Marco Valentino
Ian Pratt-Hartmann
André Freitas
DRL
38
5
0
01 Feb 2024
Finding Alignments Between Interpretable Causal Variables and
  Distributed Neural Representations
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
Atticus Geiger
Zhengxuan Wu
Christopher Potts
Thomas F. Icard
Noah D. Goodman
CML
73
98
0
05 Mar 2023
Interpretability in the Wild: a Circuit for Indirect Object
  Identification in GPT-2 small
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
210
491
0
01 Nov 2022
Post-hoc Concept Bottleneck Models
Post-hoc Concept Bottleneck Models
Mert Yuksekgonul
Maggie Wang
James Y. Zou
133
183
0
31 May 2022
Editing a classifier by rewriting its prediction rules
Editing a classifier by rewriting its prediction rules
Shibani Santurkar
Dimitris Tsipras
Mahalaxmi Elango
David Bau
Antonio Torralba
A. Madry
KELM
175
89
0
02 Dec 2021
Fast Model Editing at Scale
Fast Model Editing at Scale
E. Mitchell
Charles Lin
Antoine Bosselut
Chelsea Finn
Christopher D. Manning
KELM
219
341
0
21 Oct 2021
1