Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2408.00113
Cited By
Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
31 July 2024
Adam Karvonen
Benjamin Wright
Can Rager
Rico Angell
Jannik Brinkmann
Logan Smith
C. M. Verdun
David Bau
Samuel Marks
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models"
7 / 7 papers shown
Title
Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need
Adam Karvonen
31
0
0
21 Mar 2025
I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data?
Yuhang Liu
Dong Gong
Erdun Gao
Zhen Zhang
Biwei Huang
Mingming Gong
Anton van den Hengel
Javen Qinfeng Shi
J. Shi
67
0
0
12 Mar 2025
The Complexity of Learning Sparse Superposed Features with Feedback
Akash Kumar
60
0
0
08 Feb 2025
Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages
Jannik Brinkmann
Chris Wendler
Christian Bartelt
Aaron Mueller
41
9
0
10 Jan 2025
Transformers Use Causal World Models in Maze-Solving Tasks
Alex F Spies
William Edwards
Michael I. Ivanitskiy
Adrians Skapars
Tilman Rauker
Katsumi Inoue
A. Russo
Murray Shanahan
93
1
0
16 Dec 2024
Decomposing The Dark Matter of Sparse Autoencoders
Joshua Engels
Logan Riggs
Max Tegmark
LLMSV
50
9
0
18 Oct 2024
Toy Models of Superposition
Nelson Elhage
Tristan Hume
Catherine Olsson
Nicholas Schiefer
T. Henighan
...
Sam McCandlish
Jared Kaplan
Dario Amodei
Martin Wattenberg
C. Olah
AAML
MILM
120
314
0
21 Sep 2022
1