v1v2v3 (latest)

Residual Stream Analysis with Multi-Layer SAEs

International Conference on Learning Representations (ICLR), 2024

6 September 2024

ArXiv (abs)PDF HTML HuggingFace (1 upvotes)

Papers citing "Residual Stream Analysis with Multi-Layer SAEs"

42 / 42 papers shown

Title
Visual Exploration of Feature Relationships in Sparse Autoencoders with Curated Concepts Xinyuan Yan Shusen Liu Kowshik Thopalli Bei Wang 136 0 0 08 Nov 2025
UniFusion: Vision-Language Model as Unified Encoder in Image Generation Kevin Li Manuel Brack Sudeep Katakol Hareesh Ravi Ajinkya Kale 120 2 0 14 Oct 2025
Activation Transport Operators Andrzej Szablewski Marek Masiak 80 0 0 24 Aug 2025
Beyond Linear Steering: Unified Multi-Attribute Control for Language Models Narmeen Oozeer Luke Marks Fazl Barez Amirali Abdullah LLMSV 151 2 0 30 May 2025
Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations Aaron Jiaxun Li Suraj Srinivas Usha Bhalla Himabindu Lakkaraju AAML 299 3 0 21 May 2025
Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Lucy Farnik Tim Lawson Conor Houghton Laurence Aitchison 273 5 0 25 Feb 2025
Transformer Dynamics: A neuroscientific approach to interpretability of large language models Jesseba Fernando Grigori Guitchounts AI4CE 213 3 0 17 Feb 2025
Steering Language Model Refusal with Sparse Autoencoders Kyle O'Brien David Majercak Xavier Fernandes Richard Edgar Blake Bullwinkel Jingya Chen Harsha Nori Dean Carignan Eric Horvitz Forough Poursabzi-Sangde LLMSV 339 38 0 18 Nov 2024
Group-SAE: Efficient Training of Sparse Autoencoders for Large Language Models via Layer Groups Davide Ghilardi Federico Belotti Marco Molinari Tao Ma Matteo Palmonari 201 9 0 28 Oct 2024
Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small Maheep Chaudhary Atticus Geiger 200 28 0 05 Sep 2024
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackBoxNLP), 2024 Tom Lieberum Senthooran Rajamanoharan Arthur Conmy Lewis Smith Nicolas Sonnerat Vikrant Varma János Kramár Anca Dragan Rohin Shah Neel Nanda 263 217 0 09 Aug 2024
Gemma 2: Improving Open Language Models at a Practical Size Gemma Team Gemma Team Morgane Riviere Shreya Pathak Pier Giuseppe Sessa Cassidy Hardin ... Noah Fiedel Armand Joulin Kathleen Kenealy Robert Dadashi Alek Andreev VLM MoE OSLM 589 1,490 0 31 Jul 2024
Relational Composition in Neural Networks: A Survey and Call to Action Martin Wattenberg Fernanda Viégas CoGe 150 15 0 19 Jul 2024
Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders Senthooran Rajamanoharan Tom Lieberum Nicolas Sonnerat Arthur Conmy Vikrant Varma János Kramár Neel Nanda 298 169 0 19 Jul 2024
The Remarkable Robustness of LLMs: Stages of Inference? Vedang Lad Wes Gurnee Max Tegmark Max Tegmark 414 81 0 27 Jun 2024
Interpreting Attention Layer Outputs with Sparse Autoencoders Connor Kissane Robert Krzyzanowski Joseph Isaac Bloom Arthur Conmy Neel Nanda MILM 207 35 0 25 Jun 2024
Transcoders Find Interpretable LLM Feature Circuits Jacob Dunefsky Philippe Chlenski Neel Nanda 182 78 0 17 Jun 2024
Scaling and evaluating sparse autoencoders Leo Gao Tom Dupré la Tour Henk Tillman Gabriel Goh Rajan Troll Alec Radford Ilya Sutskever Jan Leike Jeffrey Wu 232 283 0 06 Jun 2024
Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models Charles OÑeill Thang Bui 162 12 0 21 May 2024
Identifying Functionally Important Features with End-to-End Sparse Dictionary LearningNeural Information Processing Systems (NeurIPS), 2024 Dan Braun Jordan K. Taylor Nicholas Goldowsky-Dill Lee D. Sharkey 272 52 0 17 May 2024
How does GPT-2 Predict Acronyms? Extracting and Understanding a Circuit via Mechanistic Interpretability Jorge García-Carrasco Alejandro Maté Juan Trujillo 134 12 0 07 May 2024
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models Samuel Marks Can Rager Eric J. Michaud Yonatan Belinkov David Bau Aaron Mueller 467 238 0 28 Mar 2024
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations Jing-ling Huang Zhengxuan Wu Christopher Potts Mor Geva Atticus Geiger 264 54 0 27 Feb 2024
Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT Zhengfu He Xuyang Ge Qiong Tang Tianxiang Sun Qinyuan Cheng Xipeng Qiu 183 25 0 19 Feb 2024
The Linear Representation Hypothesis and the Geometry of Large Language ModelsInternational Conference on Machine Learning (ICML), 2023 Kiho Park Yo Joong Choe Victor Veitch LLMSV MILM 425 310 0 07 Nov 2023
Sparse Autoencoders Find Highly Interpretable Features in Language ModelsInternational Conference on Learning Representations (ICLR), 2023 Hoagy Cunningham Aidan Ewart Logan Riggs R. Huben Lee Sharkey MILM 554 751 0 15 Sep 2023
Linearity of Relation Decoding in Transformer Language ModelsInternational Conference on Learning Representations (ICLR), 2023 Evan Hernandez Arnab Sen Sharma Tal Haklay Kevin Meng Martin Wattenberg Jacob Andreas Yonatan Belinkov David Bau KELM 303 133 0 17 Aug 2023
Towards Automated Circuit Discovery for Mechanistic InterpretabilityNeural Information Processing Systems (NeurIPS), 2023 Arthur Conmy Augustine N. Mavor-Parker Aengus Lynch Stefan Heimersheim Adrià Garriga-Alonso 477 432 0 28 Apr 2023
Pythia: A Suite for Analyzing Large Language Models Across Training and ScalingInternational Conference on Machine Learning (ICML), 2023 Stella Biderman Hailey Schoelkopf Quentin G. Anthony Herbie Bradley Kyle O'Brien ... USVSN Sai Prashanth Edward Raff Aviya Skowron Lintang Sutawika Oskar van der Wal 352 1,597 0 03 Apr 2023
Eliciting Latent Predictions from Transformers with the Tuned Lens Nora Belrose Zach Furman Logan Smith Danny Halawi Igor V. Ostrovsky Lev McKinney Stella Biderman Jacob Steinhardt 412 305 0 14 Mar 2023
The geometry of hidden representations of large transformer modelsNeural Information Processing Systems (NeurIPS), 2023 L. Valeriani Diego Doimo F. Cuturello Alessandro Laio A. Ansuini Alberto Cazzaniga MILM 271 78 0 01 Feb 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 smallInternational Conference on Learning Representations (ICLR), 2022 Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 524 752 0 01 Nov 2022
Disentanglement with Biological Constraints: A Theory of Functional Cell TypesInternational Conference on Learning Representations (ICLR), 2022 James C. R. Whittington W. Dorrell Surya Ganguli Timothy Edward John Behrens 234 63 0 30 Sep 2022
Toy Models of Superposition Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer T. Henighan ... Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg C. Olah AAML MILM 1.2K 540 0 21 Sep 2022
Extracting Latent Steering Vectors from Pretrained Language ModelsFindings (Findings), 2022 Nishant Subramani Nivedita Suresh Matthew E. Peters LLMSV 158 137 0 10 May 2022
Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factorsWorkshop on Knowledge Extraction and Integration for Deep Learning Architectures; Deep Learning Inside Out (DEELIO), 2021 Zeyu Yun Yubei Chen Bruno A. Olshausen Yann LeCun 228 104 0 29 Mar 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe ... Horace He Anish Thite Noa Nabeshima Shawn Presser Connor Leahy AIMat 790 2,508 0 31 Dec 2020
Measuring Massive Multitask Language UnderstandingInternational Conference on Learning Representations (ICLR), 2020 Dan Hendrycks Collin Burns Steven Basart Andy Zou Mantas Mazeika Basel Alomair Jacob Steinhardt ELM RALM 1.3K 6,318 0 07 Sep 2020
Residual Connections Encourage Iterative Inference Stanislaw Jastrzebski Devansh Arpit Nicolas Ballas Vikas Verma Tong Che Yoshua Bengio 203 173 0 13 Oct 2017
Attention Is All You NeedNeural Information Processing Systems (NeurIPS), 2017 Ashish Vaswani Noam M. Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan Gomez Lukasz Kaiser Illia Polosukhin 3DV 2.5K 158,328 0 12 Jun 2017
Zero-bias autoencoders and the benefits of co-adapting featuresInternational Conference on Learning Representations (ICLR), 2014 K. Konda Roland Memisevic David M. Krueger AI4CE 357 95 0 13 Feb 2014
k-Sparse AutoencodersInternational Conference on Learning Representations (ICLR), 2013 Alireza Makhzani Brendan J. Frey 452 501 0 19 Dec 2013