Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

9 August 2024

Senthooran Rajamanoharan

Nicolas Sonnerat

Rohin Shah

Papers citing "Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2"

16 / 66 papers shown

Title
Mechanistic Permutability: Match Features Across Layers Nikita Balagansky Ian Maksimov Daniil Gavrilov 13 4 0 10 Oct 2024
The Geometry of Concepts: Sparse Autoencoder Feature Structure Yuxiao Li Eric J. Michaud David D. Baek Joshua Engels Xiaoqing Sun Max Tegmark 50 7 0 10 Oct 2024
SAGE: Scalable Ground Truth Evaluations for Large Sparse Autoencoders Constantin Venhoff Anisoara Calinescu Philip H. S. Torr Christian Schroeder de Witt 28 0 0 09 Oct 2024
Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models Michael Lan Philip H. S. Torr Austin Meek Ashkan Khakzar David M. Krueger Fazl Barez 28 10 0 09 Oct 2024
Sparse Autoencoders Reveal Temporal Difference Learning in Large Language Models Can Demircan Tankred Saanum A. Jagadish Marcel Binz Eric Schulz 25 1 0 02 Oct 2024
Towards Inference-time Category-wise Safety Steering for Large Language Models Amrita Bhattacharjee Shaona Ghosh Traian Rebedea Christopher Parisien LLMSV 29 3 0 02 Oct 2024
A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders David Chanin James Wilken-Smith Tomáš Dulka Hardik Bhatnagar Joseph Bloom 15 17 0 22 Sep 2024
Residual Stream Analysis with Multi-Layer SAEs Tim Lawson Lucy Farnik Conor Houghton Laurence Aitchison 26 3 0 06 Sep 2024
Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small Maheep Chaudhary Atticus Geiger 18 13 0 05 Sep 2024
Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models Adam Karvonen Benjamin Wright Can Rager Rico Angell Jannik Brinkmann Logan Smith C. M. Verdun David Bau Samuel Marks 38 26 0 31 Jul 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 75 19 0 02 Jul 2024
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 210 494 0 01 Nov 2022
Toy Models of Superposition Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer T. Henighan ... Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg C. Olah AAML MILM 120 316 0 21 Sep 2022
Probing Classifiers: Promises, Shortcomings, and Advances Yonatan Belinkov 224 404 0 24 Feb 2021
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism M. Shoeybi M. Patwary Raul Puri P. LeGresley Jared Casper Bryan Catanzaro MoE 243 1,817 0 17 Sep 2019
Efficient Estimation of Word Representations in Vector Space Tomáš Mikolov Kai Chen G. Corrado J. Dean 3DV 228 31,244 0 16 Jan 2013