Decomposing The Dark Matter of Sparse Autoencoders

Decomposing The Dark Matter of Sparse Autoencoders

18 October 2024

Papers citing "Decomposing The Dark Matter of Sparse Autoencoders"

10 / 10 papers shown

Title
Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition Zhengfu He J. Wang Rui Lin Xuyang Ge Wentao Shu Qiong Tang J. Zhang Xipeng Qiu 68 0 0 29 Apr 2025
Representation Learning on a Random Lattice Aryeh Brill OOD FAtt AI4CE 63 0 0 28 Apr 2025
Robustly identifying concepts introduced during chat fine-tuning using crosscoders Julian Minder Clement Dumas Caden Juang Bilal Chugtai Neel Nanda 23 0 0 03 Apr 2025
Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality Sewoong Lee Adam Davies Marc E. Canby J. Hockenmaier LLMSV 55 0 0 31 Mar 2025
Identifying Sparsely Active Circuits Through Local Loss Landscape Decomposition Brianna Chrisman Lucius Bushnaq Lee D. Sharkey 39 0 0 31 Mar 2025
Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry Sai Sumedh R. Hindupur Ekdeep Singh Lubana Thomas Fel Demba Ba 34 4 0 03 Mar 2025
Are Sparse Autoencoders Useful? A Case Study in Sparse Probing Subhash Kantamneni Joshua Engels Senthooran Rajamanoharan Max Tegmark Neel Nanda 51 3 0 23 Feb 2025
Steering Language Model Refusal with Sparse Autoencoders Kyle O'Brien David Majercak Xavier Fernandes Richard Edgar Jingya Chen Harsha Nori Dean Carignan Eric Horvitz Forough Poursabzi-Sangde LLMSV 52 9 0 18 Nov 2024
Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs Daniel J. Lee Stefan Heimersheim AAML 24 4 0 16 Oct 2024
Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models Michael Lan Philip H. S. Torr Austin Meek Ashkan Khakzar David M. Krueger Fazl Barez 28 9 0 09 Oct 2024