Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2409.04185
Cited By
v1
v2
v3 (latest)
Residual Stream Analysis with Multi-Layer SAEs
International Conference on Learning Representations (ICLR), 2024
6 September 2024
Tim Lawson
Lucy Farnik
Conor Houghton
Laurence Aitchison
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (1 upvotes)
Papers citing
"Residual Stream Analysis with Multi-Layer SAEs"
42 / 42 papers shown
Title
Visual Exploration of Feature Relationships in Sparse Autoencoders with Curated Concepts
Xinyuan Yan
Shusen Liu
Kowshik Thopalli
Bei Wang
136
0
0
08 Nov 2025
UniFusion: Vision-Language Model as Unified Encoder in Image Generation
Kevin Li
Manuel Brack
Sudeep Katakol
Hareesh Ravi
Ajinkya Kale
120
2
0
14 Oct 2025
Activation Transport Operators
Andrzej Szablewski
Marek Masiak
80
0
0
24 Aug 2025
Beyond Linear Steering: Unified Multi-Attribute Control for Language Models
Narmeen Oozeer
Luke Marks
Fazl Barez
Amirali Abdullah
LLMSV
151
2
0
30 May 2025
Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations
Aaron Jiaxun Li
Suraj Srinivas
Usha Bhalla
Himabindu Lakkaraju
AAML
299
3
0
21 May 2025
Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations
Lucy Farnik
Tim Lawson
Conor Houghton
Laurence Aitchison
273
5
0
25 Feb 2025
Transformer Dynamics: A neuroscientific approach to interpretability of large language models
Jesseba Fernando
Grigori Guitchounts
AI4CE
213
3
0
17 Feb 2025
Steering Language Model Refusal with Sparse Autoencoders
Kyle O'Brien
David Majercak
Xavier Fernandes
Richard Edgar
Blake Bullwinkel
Jingya Chen
Harsha Nori
Dean Carignan
Eric Horvitz
Forough Poursabzi-Sangde
LLMSV
339
38
0
18 Nov 2024
Group-SAE: Efficient Training of Sparse Autoencoders for Large Language Models via Layer Groups
Davide Ghilardi
Federico Belotti
Marco Molinari
Tao Ma
Matteo Palmonari
201
9
0
28 Oct 2024
Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small
Maheep Chaudhary
Atticus Geiger
200
28
0
05 Sep 2024
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackBoxNLP), 2024
Tom Lieberum
Senthooran Rajamanoharan
Arthur Conmy
Lewis Smith
Nicolas Sonnerat
Vikrant Varma
János Kramár
Anca Dragan
Rohin Shah
Neel Nanda
263
217
0
09 Aug 2024
Gemma 2: Improving Open Language Models at a Practical Size
Gemma Team
Gemma Team Morgane Riviere
Shreya Pathak
Pier Giuseppe Sessa
Cassidy Hardin
...
Noah Fiedel
Armand Joulin
Kathleen Kenealy
Robert Dadashi
Alek Andreev
VLM
MoE
OSLM
589
1,490
0
31 Jul 2024
Relational Composition in Neural Networks: A Survey and Call to Action
Martin Wattenberg
Fernanda Viégas
CoGe
150
15
0
19 Jul 2024
Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders
Senthooran Rajamanoharan
Tom Lieberum
Nicolas Sonnerat
Arthur Conmy
Vikrant Varma
János Kramár
Neel Nanda
298
169
0
19 Jul 2024
The Remarkable Robustness of LLMs: Stages of Inference?
Vedang Lad
Wes Gurnee
Max Tegmark
Max Tegmark
414
81
0
27 Jun 2024
Interpreting Attention Layer Outputs with Sparse Autoencoders
Connor Kissane
Robert Krzyzanowski
Joseph Isaac Bloom
Arthur Conmy
Neel Nanda
MILM
207
35
0
25 Jun 2024
Transcoders Find Interpretable LLM Feature Circuits
Jacob Dunefsky
Philippe Chlenski
Neel Nanda
182
78
0
17 Jun 2024
Scaling and evaluating sparse autoencoders
Leo Gao
Tom Dupré la Tour
Henk Tillman
Gabriel Goh
Rajan Troll
Alec Radford
Ilya Sutskever
Jan Leike
Jeffrey Wu
232
283
0
06 Jun 2024
Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models
Charles OÑeill
Thang Bui
162
12
0
21 May 2024
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Neural Information Processing Systems (NeurIPS), 2024
Dan Braun
Jordan K. Taylor
Nicholas Goldowsky-Dill
Lee D. Sharkey
272
52
0
17 May 2024
How does GPT-2 Predict Acronyms? Extracting and Understanding a Circuit via Mechanistic Interpretability
Jorge García-Carrasco
Alejandro Maté
Juan Trujillo
134
12
0
07 May 2024
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Samuel Marks
Can Rager
Eric J. Michaud
Yonatan Belinkov
David Bau
Aaron Mueller
467
238
0
28 Mar 2024
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
Jing-ling Huang
Zhengxuan Wu
Christopher Potts
Mor Geva
Atticus Geiger
264
54
0
27 Feb 2024
Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT
Zhengfu He
Xuyang Ge
Qiong Tang
Tianxiang Sun
Qinyuan Cheng
Xipeng Qiu
183
25
0
19 Feb 2024
The Linear Representation Hypothesis and the Geometry of Large Language Models
International Conference on Machine Learning (ICML), 2023
Kiho Park
Yo Joong Choe
Victor Veitch
LLMSV
MILM
425
310
0
07 Nov 2023
Sparse Autoencoders Find Highly Interpretable Features in Language Models
International Conference on Learning Representations (ICLR), 2023
Hoagy Cunningham
Aidan Ewart
Logan Riggs
R. Huben
Lee Sharkey
MILM
554
751
0
15 Sep 2023
Linearity of Relation Decoding in Transformer Language Models
International Conference on Learning Representations (ICLR), 2023
Evan Hernandez
Arnab Sen Sharma
Tal Haklay
Kevin Meng
Martin Wattenberg
Jacob Andreas
Yonatan Belinkov
David Bau
KELM
303
133
0
17 Aug 2023
Towards Automated Circuit Discovery for Mechanistic Interpretability
Neural Information Processing Systems (NeurIPS), 2023
Arthur Conmy
Augustine N. Mavor-Parker
Aengus Lynch
Stefan Heimersheim
Adrià Garriga-Alonso
477
432
0
28 Apr 2023
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
International Conference on Machine Learning (ICML), 2023
Stella Biderman
Hailey Schoelkopf
Quentin G. Anthony
Herbie Bradley
Kyle O'Brien
...
USVSN Sai Prashanth
Edward Raff
Aviya Skowron
Lintang Sutawika
Oskar van der Wal
352
1,597
0
03 Apr 2023
Eliciting Latent Predictions from Transformers with the Tuned Lens
Nora Belrose
Zach Furman
Logan Smith
Danny Halawi
Igor V. Ostrovsky
Lev McKinney
Stella Biderman
Jacob Steinhardt
412
305
0
14 Mar 2023
The geometry of hidden representations of large transformer models
Neural Information Processing Systems (NeurIPS), 2023
L. Valeriani
Diego Doimo
F. Cuturello
Alessandro Laio
A. Ansuini
Alberto Cazzaniga
MILM
271
78
0
01 Feb 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
International Conference on Learning Representations (ICLR), 2022
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
524
752
0
01 Nov 2022
Disentanglement with Biological Constraints: A Theory of Functional Cell Types
International Conference on Learning Representations (ICLR), 2022
James C. R. Whittington
W. Dorrell
Surya Ganguli
Timothy Edward John Behrens
234
63
0
30 Sep 2022
Toy Models of Superposition
Nelson Elhage
Tristan Hume
Catherine Olsson
Nicholas Schiefer
T. Henighan
...
Sam McCandlish
Jared Kaplan
Dario Amodei
Martin Wattenberg
C. Olah
AAML
MILM
1.2K
540
0
21 Sep 2022
Extracting Latent Steering Vectors from Pretrained Language Models
Findings (Findings), 2022
Nishant Subramani
Nivedita Suresh
Matthew E. Peters
LLMSV
158
137
0
10 May 2022
Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors
Workshop on Knowledge Extraction and Integration for Deep Learning Architectures; Deep Learning Inside Out (DEELIO), 2021
Zeyu Yun
Yubei Chen
Bruno A. Olshausen
Yann LeCun
228
104
0
29 Mar 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
790
2,508
0
31 Dec 2020
Measuring Massive Multitask Language Understanding
International Conference on Learning Representations (ICLR), 2020
Dan Hendrycks
Collin Burns
Steven Basart
Andy Zou
Mantas Mazeika
Basel Alomair
Jacob Steinhardt
ELM
RALM
1.3K
6,318
0
07 Sep 2020
Residual Connections Encourage Iterative Inference
Stanislaw Jastrzebski
Devansh Arpit
Nicolas Ballas
Vikas Verma
Tong Che
Yoshua Bengio
203
173
0
13 Oct 2017
Attention Is All You Need
Neural Information Processing Systems (NeurIPS), 2017
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
2.5K
158,328
0
12 Jun 2017
Zero-bias autoencoders and the benefits of co-adapting features
International Conference on Learning Representations (ICLR), 2014
K. Konda
Roland Memisevic
David M. Krueger
AI4CE
357
95
0
13 Feb 2014
k-Sparse Autoencoders
International Conference on Learning Representations (ICLR), 2013
Alireza Makhzani
Brendan J. Frey
452
501
0
19 Dec 2013
1