ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.11836
99
0

SplInterp: Improving our Understanding and Training of Sparse Autoencoders

17 May 2025
Jeremy Budd
Javier Ideami
Benjamin Macdowall Rynne
Keith Duggar
Randall Balestriero
ArXiv (abs)PDFHTML
Main:10 Pages
38 Figures
Bibliography:2 Pages
Appendix:32 Pages
Abstract

Sparse autoencoders (SAEs) have received considerable recent attention as tools for mechanistic interpretability, showing success at extracting interpretable features even from very large LLMs. However, this research has been largely empirical, and there have been recent doubts about the true utility of SAEs. In this work, we seek to enhance the theoretical understanding of SAEs, using the spline theory of deep learning. By situating SAEs in this framework: we discover that SAEs generalise ``kkk-means autoencoders'' to be piecewise affine, but sacrifice accuracy for interpretability vs. the optimal ``kkk-means-esque plus local principal component analysis (PCA)'' piecewise affine autoencoder. We characterise the underlying geometry of (TopK) SAEs using power diagrams. And we develop a novel proximal alternating method SGD (PAM-SGD) algorithm for training SAEs, with both solid theoretical foundations and promising empirical results in MNIST and LLM experiments, particularly in sample efficiency and (in the LLM setting) improved sparsity of codes. All code is available at:this https URL

View on arXiv
@article{budd2025_2505.11836,
  title={ SplInterp: Improving our Understanding and Training of Sparse Autoencoders },
  author={ Jeremy Budd and Javier Ideami and Benjamin Macdowall Rynne and Keith Duggar and Randall Balestriero },
  journal={arXiv preprint arXiv:2505.11836},
  year={ 2025 }
}
Comments on this paper