Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

9 August 2024

Tom Lieberum

Senthooran Rajamanoharan

Rohin Shah

Papers citing "Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2"

50 / 66 papers shown

Title
Are Sparse Autoencoders Useful for Java Function Bug Detection? Rui Melo Claudia Mamede Andre Catarino Rui Abreu Henrique Lopes Cardoso 14 0 0 15 May 2025
An Introduction to Discrete Variational Autoencoders Alan Jeffares Liyuan Liu DRL BDL CML 27 0 0 15 May 2025
Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders Dong Shu Xuansheng Wu Haiyan Zhao Mengnan Du Ninghao Liu LLMSV 37 0 0 12 May 2025
Patterns and Mechanisms of Contrastive Activation Engineering Yixiong Hao Ayush Panda Stepan Shabalin Sheikh Abdur Raheem Ali LLMSV 60 0 0 06 May 2025
Empirical Evaluation of Progressive Coding for Sparse Autoencoders Hans Peter Anders Søgaard 36 0 0 30 Apr 2025
Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition Zhengfu He J. Wang Rui Lin Xuyang Ge Wentao Shu Qiong Tang J. Zhang Xipeng Qiu 70 0 0 29 Apr 2025
Chain-of-Defensive-Thought: Structured Reasoning Elicits Robustness in Large Language Models against Reference Corruption Wenxiao Wang Parsa Hosseini S. Feizi LRM AI4CE 53 0 0 29 Apr 2025
Representation Learning on a Random Lattice Aryeh Brill OOD FAtt AI4CE 73 0 0 28 Apr 2025
Scaling sparse feature circuit finding for in-context learning Dmitrii Kharlapenko S. Kamath S Fazl Barez Arthur Conmy Neel Nanda 26 0 0 18 Apr 2025
MIB: A Mechanistic Interpretability Benchmark Aaron Mueller Atticus Geiger Sarah Wiegreffe Dana Arad Iván Arcuschin ... Alessandro Stolfo Martin Tutek Amir Zur David Bau Yonatan Belinkov 41 1 0 17 Apr 2025
$SAEs $\textit{Can}$ Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs$ SAEs $\textit{Can}$ Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs Aashiq Muhamed Jacopo Bonato Mona Diab Virginia Smith MU 58 0 0 11 Apr 2025
Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric Yixin Cao Jiahao Ying Y. Wang Xipeng Qiu Xuanjing Huang Yugang Jiang ELM 30 2 0 10 Apr 2025
Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems Simon Lermen Mateusz Dziemian Natalia Pérez-Campanero Antolín 31 0 0 10 Apr 2025
Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality Sewoong Lee Adam Davies Marc E. Canby J. Hockenmaier LLMSV 65 0 0 31 Mar 2025
Learning Multi-Level Features with Matryoshka Sparse Autoencoders Bart Bussmann Noa Nabeshima Adam Karvonen Neel Nanda 54 0 0 21 Mar 2025
Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need Adam Karvonen 34 0 0 21 Mar 2025
Towards LLM Guardrails via Sparse Representation Steering Zeqing He Zhibo Wang Huiyu Xu Kui Ren LLMSV 52 1 0 21 Mar 2025
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability Adam Karvonen Can Rager Johnny Lin Curt Tigges Joseph Isaac Bloom ... Kola Ayonrinde Matthew Wearden Arthur Conmy Samuel Marks Neel Nanda MU 55 8 0 12 Mar 2025
Mixture of Experts Made Intrinsically Interpretable Xingyi Yang Constantin Venhoff Ashkan Khakzar Christian Schroeder de Witt P. Dokania Adel Bibi Philip H. S. Torr MoE 49 0 0 05 Mar 2025
Towards Understanding Distilled Reasoning Models: A Representational Approach David D. Baek Max Tegmark LRM 75 2 0 05 Mar 2025
Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders Kristian Kuznetsov Laida Kushnareva Polina Druzhinina Anton Razzhigaev Anastasia Voznyuk Irina Piontkovskaya Evgeny Burnaev Serguei Barannikov 34 0 0 05 Mar 2025
SAFE: A Sparse Autoencoder-Based Framework for Robust Query Enrichment and Hallucination Mitigation in LLMs Samir Abdaljalil Filippo Pallucchini Andrea Seveso Hasan Kurban Fabio Mercorio Erchin Serpedin HILM 72 0 0 04 Mar 2025
Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry Sai Sumedh R. Hindupur Ekdeep Singh Lubana Thomas Fel Demba Ba 37 4 0 03 Mar 2025
Interpreting CLIP with Hierarchical Sparse Autoencoders Vladimir Zaigrajew Hubert Baniecki P. Biecek 49 0 0 27 Feb 2025
Do Sparse Autoencoders Generalize? A Case Study of Answerability Lovis Heindrich Philip H. S. Torr Fazl Barez Veronika Thost 69 1 0 27 Feb 2025
Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders Xuansheng Wu Jiayi Yuan Wenlin Yao Xiaoming Zhai Ninghao Liu LLMSV 78 4 0 24 Feb 2025
FADE: Why Bad Descriptions Happen to Good Features Bruno Puri Aakriti Jain Elena Golimblevskaia Patrick Kahardipraja Thomas Wiegand Wojciech Samek Sebastian Lapuschkin 100 0 0 24 Feb 2025
The Knowledge Microscope: Features as Better Analytical Lenses than Neurons Yuheng Chen Pengfei Cao Kang Liu Jun Zhao 43 0 0 18 Feb 2025
Policy-to-Language: Train LLMs to Explain Decisions with Flow-Matching Generated Rewards Xinyi Yang Liang Zeng Heng Dong C. Yu X. Wu H. Yang Yu Wang Milind Tambe Tonghan Wang 68 2 0 18 Feb 2025
The Multi-Faceted Monosemanticity in Multimodal Representations Hanqi Yan Xiangxiang Cui Lu Yin Paul Pu Liang Yulan He Yifei Wang 44 0 0 16 Feb 2025
SEER: Self-Explainability Enhancement of Large Language Models' Representations Guanxu Chen Dongrui Liu Tao Luo Jing Shao LRM MILM 65 1 0 07 Feb 2025
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words Gouki Minegishi Hiroki Furuta Yusuke Iwasawa Y. Matsuo 49 1 0 09 Jan 2025
Can Input Attributions Interpret the Inductive Reasoning Process Elicited in In-Context Learning? Mengyu Ye Tatsuki Kuribayashi Goro Kobayashi Jun Suzuki LRM 87 0 0 20 Dec 2024
VISTA: A Panoramic View of Neural Representations Tom White 62 0 0 03 Dec 2024
Think-to-Talk or Talk-to-Think? When LLMs Come Up with an Answer in Multi-Step Arithmetic Reasoning Keito Kudo Yoichi Aoki Tatsuki Kuribayashi Shusaku Sone Masaya Taniguchi Ana Brassard Keisuke Sakaguchi Kentaro Inui ReLM LRM 69 0 0 02 Dec 2024
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models Javier Ferrando Oscar Obeso Senthooran Rajamanoharan Neel Nanda 75 10 0 21 Nov 2024
Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders Charles OÑeill David Klindt David Klindt 84 1 0 20 Nov 2024
SCAR: Sparse Conditioned Autoencoders for Concept Detection and Steering in LLMs Ruben Härle Felix Friedrich Manuel Brack Bjorn Deiseroth P. Schramowski Kristian Kersting 23 0 0 11 Nov 2024
Beyond Toxic Neurons: A Mechanistic Analysis of DPO for Toxicity Reduction Yushi Yang Filip Sondej Harry Mayne Adam Mahdi 24 0 0 10 Nov 2024
Towards Unifying Interpretability and Control: Evaluation via Intervention Usha Bhalla Suraj Srinivas Asma Ghandeharioun Himabindu Lakkaraju 38 5 0 07 Nov 2024
Improving Steering Vectors by Targeting Sparse Autoencoder Features Sviatoslav Chalnev Matthew Siu Arthur Conmy LLMSV 47 16 0 04 Nov 2024
Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models Aashiq Muhamed Mona Diab Virginia Smith 38 2 0 01 Nov 2024
Efficient Training of Sparse Autoencoders for Large Language Models via Layer Groups Davide Ghilardi Federico Belotti Marco Molinari 21 2 0 28 Oct 2024
Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders Viacheslav Surkov Chris Wendler Mikhail Terekhov Justin Deschenaux Robert West Çağlar Gülçehre VLM 40 13 0 28 Oct 2024
Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustness Qi Zhang Yifei Wang Jingyi Cui Xiang Pan Qi Lei Stefanie Jegelka Yisen Wang AAML 34 1 0 27 Oct 2024
Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders Zhengfu He Wentao Shu Xuyang Ge Lingjie Chen Junxuan Wang ... Qipeng Guo Xuanjing Huang Zuxuan Wu Yu Jiang Xipeng Qiu 21 13 0 27 Oct 2024
Applying sparse autoencoders to unlearn knowledge in language models Eoin Farrell Yeu-Tong Lau Arthur Conmy MU 29 14 0 25 Oct 2024
Decomposing The Dark Matter of Sparse Autoencoders Joshua Engels Logan Riggs Max Tegmark LLMSV 55 9 0 18 Oct 2024
The Persian Rug: solving toy models of superposition using large-scale symmetries Aditya Cowsik Kfir Dolev Alex Infanger 19 0 0 15 Oct 2024
Analyzing (In)Abilities of SAEs via Formal Languages Abhinav Menon Manish Shrivastava David M. Krueger Ekdeep Singh Lubana 42 7 0 15 Oct 2024