Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2407.14435
Cited By
Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders
19 July 2024
Senthooran Rajamanoharan
Tom Lieberum
Nicolas Sonnerat
Arthur Conmy
Vikrant Varma
János Kramár
Neel Nanda
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders"
29 / 29 papers shown
Title
Are Sparse Autoencoders Useful for Java Function Bug Detection?
Rui Melo
Claudia Mamede
Andre Catarino
Rui Abreu
Henrique Lopes Cardoso
31
0
0
15 May 2025
Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders
Boyi Deng
Boyi Deng
Yidan Zhang
Baosong Yang
Fuli Feng
41
0
0
08 May 2025
A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i
Kola Ayonrinde
Louis Jaburi
MILM
88
1
0
01 May 2025
Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video
Sonia Joseph
Praneet Suresh
Lorenz Hufe
Edward Stevinson
Robert Graham
Yash Vadi
Danilo Bzdok
Sebastian Lapuschkin
Lee Sharkey
Blake A. Richards
72
0
0
28 Apr 2025
Investigating task-specific prompts and sparse autoencoders for activation monitoring
Henk Tillman
Dan Mossing
LLMSV
50
0
0
28 Apr 2025
Interpreting the Linear Structure of Vision-language Model Embedding Spaces
Isabel Papadimitriou
Huangyuan Su
Thomas Fel
Naomi Saphra
Sham Kakade
Stephanie Gil
VLM
54
0
0
16 Apr 2025
Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need
Adam Karvonen
34
0
0
21 Mar 2025
I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data?
Yuhang Liu
Dong Gong
Erdun Gao
Zhen Zhang
Zhen Zhang
Biwei Huang
Anton van den Hengel
Javen Qinfeng Shi
Javen Qinfeng Shi
199
0
0
12 Mar 2025
Towards Understanding Distilled Reasoning Models: A Representational Approach
David D. Baek
Max Tegmark
LRM
80
3
0
05 Mar 2025
FADE: Why Bad Descriptions Happen to Good Features
Bruno Puri
Aakriti Jain
Elena Golimblevskaia
Patrick Kahardipraja
Thomas Wiegand
Wojciech Samek
Sebastian Lapuschkin
135
0
0
24 Feb 2025
SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models
Z. He
Haiyan Zhao
Yiran Qiao
Fan Yang
Ali Payani
Jing Ma
Mengnan Du
LLMSV
74
2
0
17 Feb 2025
Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment
Harrish Thasarathan
Julian Forsyth
Thomas Fel
M. Kowal
Konstantinos G. Derpanis
111
7
0
06 Feb 2025
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words
Gouki Minegishi
Hiroki Furuta
Yusuke Iwasawa
Y. Matsuo
49
1
0
09 Jan 2025
Tracking the Feature Dynamics in LLM Training: A Mechanistic Study
Yang Xu
Yansen Wang
Hao Wang
159
1
0
23 Dec 2024
Towards scientific discovery with dictionary learning: Extracting biological concepts from microscopy foundation models
Konstantin Donhauser
Kristina Ulicna
Gemma Elyse Moran
Aditya Ravuri
Kian Kenyon-Dean
Cian Eastwood
Jason Hartford
81
0
0
20 Dec 2024
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
Javier Ferrando
Oscar Obeso
Senthooran Rajamanoharan
Neel Nanda
85
12
0
21 Nov 2024
Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders
Charles OÑeill
David Klindt
David Klindt
98
1
0
20 Nov 2024
Towards Unifying Interpretability and Control: Evaluation via Intervention
Usha Bhalla
Suraj Srinivas
Asma Ghandeharioun
Himabindu Lakkaraju
42
5
0
07 Nov 2024
Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering
Yu Zhao
Alessio Devoto
Giwon Hong
Xiaotang Du
Aryo Pradipta Gema
Hongru Wang
Xuanli He
Kam-Fai Wong
Pasquale Minervini
KELM
LLMSV
42
16
0
21 Oct 2024
Decomposing The Dark Matter of Sparse Autoencoders
Joshua Engels
Logan Riggs
Max Tegmark
LLMSV
65
10
0
18 Oct 2024
The Geometry of Concepts: Sparse Autoencoder Feature Structure
Yuxiao Li
Eric J. Michaud
David D. Baek
Joshua Engels
Xiaoqing Sun
Max Tegmark
58
7
0
10 Oct 2024
Quantifying Feature Space Universality Across Large Language Models via Sparse Autoencoders
Michael Lan
Philip Torr
Austin Meek
Ashkan Khakzar
David M. Krueger
Fazl Barez
43
11
0
09 Oct 2024
Residual Stream Analysis with Multi-Layer SAEs
Tim Lawson
Lucy Farnik
Conor Houghton
Laurence Aitchison
31
3
0
06 Sep 2024
On the Complexity of Neural Computation in Superposition
Micah Adler
Nir Shavit
123
3
0
05 Sep 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
Daking Rai
Yilun Zhou
Shi Feng
Abulhair Saparov
Ziyu Yao
82
19
0
02 Jul 2024
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Samuel Marks
Can Rager
Eric J. Michaud
Yonatan Belinkov
David Bau
Aaron Mueller
46
115
0
28 Mar 2024
Finding Neurons in a Haystack: Case Studies with Sparse Probing
Wes Gurnee
Neel Nanda
Matthew Pauly
Katherine Harvey
Dmitrii Troitskii
Dimitris Bertsimas
MILM
162
190
0
02 May 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
212
497
0
01 Nov 2022
Extremely Simple Activation Shaping for Out-of-Distribution Detection
Andrija Djurisic
Nebojsa Bozanic
Arjun Ashok
Rosanne Liu
OODD
172
151
0
20 Sep 2022
1