v1v2v3 (latest)

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

19 July 2024

Senthooran Rajamanoharan

ArXiv (abs)PDF HTML HuggingFace (7 upvotes)Github (35249★)

Papers citing "Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders"

50 / 128 papers shown

Are Sparse Autoencoders Useful for Java Function Bug Detection?

Henrique Lopes Cardoso

506

10 Apr 2026

AlignSAE: Concept-Aligned Sparse Autoencoders

Steven Bethard

Mihai Surdeanu

Liangming Pan

LLMSV

452

01 Dec 2025

SAGE: An Agentic Explainer Framework for Interpreting SAE Features in Language Models

158

25 Nov 2025

Sparse Autoencoders are Topic Models

Leander Girrbach

Zeynep Akata

165

20 Nov 2025

Weight-sparse transformers have interpretable circuits

327

17 Nov 2025

Visual Exploration of Feature Relationships in Sparse Autoencoders with Curated Concepts

214

08 Nov 2025

Beyond Redundancy: Diverse and Specialized Multi-Expert Sparse Autoencoder

336

07 Nov 2025

Making Interpretable Discoveries from Unstructured Data: A High-Dimensional Multiple Hypothesis Testing Approach

Jacob Carlson

154

03 Nov 2025

Finding Manifolds With Bilinear Autoencoders

Thomas Dooms

Ward Gauderis

167

19 Oct 2025

Time-Aware Feature Selection: Adaptive Temporal Masking for Stable Sparse Autoencoder Training

T. Ed Li

Junyu Ren

09 Oct 2025

Memory Retrieval and Consolidation in Large Language Models through Function Tokens

122

09 Oct 2025

Semantic Regexes: Auto-Interpreting LLM Features with a Structured Language

192

07 Oct 2025

Does higher interpretability imply better utility? A Pairwise Analysis on Sparse Autoencoders

262

04 Oct 2025

Interpreting Language Models Through Concept Descriptions: A Survey

Nils Feldhus

Laura Kopf

MILM

196

01 Oct 2025

AbsTopK: Rethinking Sparse Autoencoders For Bidirectional Features

Xudong Zhu

Mohammad Mahdi Khalili

Zhihui Zhu

303

01 Oct 2025

Sparse Autoencoders Make Audio Foundation Models more Explainable

160

29 Sep 2025

Binary Sparse Coding for Interpretability

Lucia Quirke

Stepan Shabalin

Nora Belrose

134

29 Sep 2025

LLM Interpretability with Identifiable Temporal-Instantaneous Representation

188

27 Sep 2025

Analysis of Variational Sparse Autoencoders

Zachary Baker

Yuxiao Li

DRL

373

26 Sep 2025

OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features

188

26 Sep 2025

Binary Autoencoder for Mechanistic Interpretability of Large Language Models

292

25 Sep 2025

Towards Atoms of Large Language Models

149

25 Sep 2025

GPT and Prejudice: A Sparse Approach to Understanding Learned Representations in Large Language Models

Mariam Mahran

Katharina Simbeck

386

24 Sep 2025

Mechanistic Interpretability with SAEs: Probing Religion, Violence, and Geography in Large Language Models

Katharina Simbeck

Mariam Mahran

MILM LLMSV

304

22 Sep 2025

Evolution of Concepts in Language Model Pre-Training

159

21 Sep 2025

ConceptViz: A Visual Analytics Approach for Exploring Concepts in Large Language Models

208

20 Sep 2025

Sparse-Autoencoder-Guided Internal Representation Unlearning for Large Language Models

157

19 Sep 2025

The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features

Jeremias Lino Ferrao

Matthijs van der Lende

Ilija Lichkovski

Clement Neo

LLMSV

349

16 Sep 2025

Rethinking Sparse Autoencoders: Select-and-Project for Fairness and Control from Encoder Features Alone

Antonio Bărbălău

Cristian Daniel Păduraru

Teodor Poncu

Alexandru Tifrea

Elena Burceanu

250

13 Sep 2025

Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework

180

11 Sep 2025

Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining

Deniz Bayazit

Aaron Mueller

Antoine Bosselut

160

05 Sep 2025

Mechanistic Interpretability with Sparse Autoencoder Neural Operators

Bahareh Tolooshams

Ailsa Shen

A. Anandkumar

175

03 Sep 2025

Understanding sparse autoencoder scaling in the presence of feature manifolds

Eric J. Michaud

Liv Gorton

Tom McGrath

297

02 Sep 2025

CE-Bench: Towards a Reliable Contrastive Evaluation Benchmark of Interpretability of Sparse Autoencoders

Alex Gulko

Yusen Peng

Sachin Kumar

199

31 Aug 2025

AdaptiveK Sparse Autoencoders: Dynamic Sparsity Allocation for Interpretable LLM Representations

Yifei Yao

Mengnan Du

220

24 Aug 2025

Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning

241

23 Aug 2025

Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders

David Chanin

Adrià Garriga-Alonso

253

22 Aug 2025

Evaluating Sparse Autoencoders for Monosemantic Representation

Moghis Fereidouni

Muhammad Umair Haider

Peizhong Ju

A.B. Siddique

200

20 Aug 2025

CorrSteer: Generation-Time LLM Steering via Correlated Sparse Autoencoder Features

Seonglae Cho

Zekun Wu

Adriano Soares Koshiyama

LLMSV

383

18 Aug 2025

Resurrecting the Salmon: Rethinking Mechanistic Interpretability with Domain-Specific Sparse Autoencoders

Charles OÑeill

Mudith Jayasekara

Max Kirkby

154

12 Aug 2025

Interpretable Reward Model via Sparse Autoencoder

744

12 Aug 2025

Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs

Ziqian Zhong

Aditi Raghunathan

259

31 Jul 2025

Interpreting CFD Surrogates through Sparse Autoencoders

Yeping Hu

Shusen Liu

AI4CE

192

21 Jul 2025

Semantic Convergence: Investigating Shared Representations Across Scaled LLMs

175

21 Jul 2025

SASFT: Sparse Autoencoder-guided Supervised Finetuning to Mitigate Unexpected Code-Switching in LLMs

242

20 Jul 2025

From Black Box to Biomarker: Sparse Autoencoders for Interpreting Speech Models of Parkinson's Disease

Peter William VanHarn Plantinga

199

16 Jul 2025

SAFER: Probing Safety in Reward Models with Sparse Autoencoder

227

01 Jul 2025

Persona Features Control Emergent Misalignment

...

342

24 Jun 2025

Sparse Feature Coactivation Reveals Causal Semantic Modules in Large Language Models

252

22 Jun 2025

Dense SAE Latents Are Features, Not Bugs

Senthooran Rajamanoharan

Mrinmaya Sachan

Max Tegmark

438

18 Jun 2025