Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models

31 July 2024

Papers citing "Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models"

7 / 7 papers shown

Title
Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need Adam Karvonen 31 0 0 21 Mar 2025
I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data? Yuhang Liu Dong Gong Erdun Gao Zhen Zhang Biwei Huang Mingming Gong Anton van den Hengel Javen Qinfeng Shi J. Shi 67 0 0 12 Mar 2025
The Complexity of Learning Sparse Superposed Features with Feedback Akash Kumar 60 0 0 08 Feb 2025
Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages Jannik Brinkmann Chris Wendler Christian Bartelt Aaron Mueller 41 9 0 10 Jan 2025
Transformers Use Causal World Models in Maze-Solving Tasks Alex F Spies William Edwards Michael I. Ivanitskiy Adrians Skapars Tilman Rauker Katsumi Inoue A. Russo Murray Shanahan 93 1 0 16 Dec 2024
Decomposing The Dark Matter of Sparse Autoencoders Joshua Engels Logan Riggs Max Tegmark LLMSV 50 9 0 18 Oct 2024
Toy Models of Superposition Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer T. Henighan ... Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg C. Olah AAML MILM 120 314 0 21 Sep 2022