Sparse Autoencoders Find Highly Interpretable Features in Language Models

15 September 2023

Papers citing "Sparse Autoencoders Find Highly Interpretable Features in Language Models"

49 / 49 papers shown

Title
Demystifying optimized prompts in language models Rimon Melamed Lucas H. McCabe H. H. Huang 31 0 0 04 May 2025
Interpreting Multilingual and Document-Length Sensitive Relevance Computations in Neural Retrieval Models through Axiomatic Causal Interventions Oliver Savolainen Dur e Najaf Amjad Roxana Petcu AAML 14 0 0 04 May 2025
Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii Kola Ayonrinde Louis Jaburi XAI 52 1 0 02 May 2025
Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition Zhengfu He J. Wang Rui Lin Xuyang Ge Wentao Shu Qiong Tang J. Zhang Xipeng Qiu 68 0 0 29 Apr 2025
Improving Reasoning Performance in Large Language Models via Representation Engineering Bertram Højer Oliver Jarvis Stefan Heinrich LRM 67 1 0 28 Apr 2025
Decoding Vision Transformers: the Diffusion Steering Lens Ryota Takatsuki Sonia Joseph Ippei Fujisawa Ryota Kanai DiffM 23 0 0 18 Apr 2025
Towards Combinatorial Interpretability of Neural Computation Micah Adler Dan Alistarh Nir Shavit FAtt 78 1 0 10 Apr 2025
Effective Skill Unlearning through Intervention and Abstention Yongce Li Chung-En Sun Tsui-Wei Weng MU 43 0 0 27 Mar 2025
Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need Adam Karvonen 26 0 0 21 Mar 2025
Representational Similarity via Interpretable Visual Concepts Neehar Kondapaneni Oisin Mac Aodha Pietro Perona DRL 57 0 0 19 Mar 2025
HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks Jiuding Sun Jing Huang Sidharth Baskaran Karel DÓosterlinck Christopher Potts Michael Sklar Atticus Geiger AI4CE 55 0 0 13 Mar 2025
Sparse Autoencoder as a Zero-Shot Classifier for Concept Erasing in Text-to-Image Diffusion Models Zhihua Tian Sirun Nan Ming Xu Shengfang Zhai Wenjie Qu Jian Liu Kui Ren Ruoxi Jia Jiaheng Zhang DiffM 65 1 0 12 Mar 2025
Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models Thomas Winninger Boussad Addad Katarzyna Kapusta AAML 59 0 0 08 Mar 2025
Finite State Automata Inside Transformers with Chain-of-Thought: A Mechanistic Study on State Tracking Yifan Zhang Wenyu Du Dongming Jin Jie Fu Zhi Jin LRM 46 0 0 27 Feb 2025
LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers Anton Razzhigaev Matvey Mikhalchuk Temurbek Rahmatullaev Elizaveta Goncharova Polina Druzhinina Ivan V. Oseledets Andrey Kuznetsov 48 1 0 20 Feb 2025
The Knowledge Microscope: Features as Better Analytical Lenses than Neurons Yuheng Chen Pengfei Cao Kang Liu Jun Zhao 40 0 0 18 Feb 2025
Building Bridges, Not Walls -- Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution Shichang Zhang Tessa Han Usha Bhalla Hima Lakkaraju FAtt 143 0 0 17 Feb 2025
The Complexity of Learning Sparse Superposed Features with Feedback Akash Kumar 50 0 0 08 Feb 2025
Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment Harrish Thasarathan Julian Forsyth Thomas Fel M. Kowal Konstantinos G. Derpanis 81 7 0 06 Feb 2025
Modular Training of Neural Networks aids Interpretability Satvik Golechha Maheep Chaudhary Joan Velja Alessandro Abate Nandi Schoots 71 0 0 04 Feb 2025
SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders Bartosz Cywiñski Kamil Deja DiffM 52 6 0 29 Jan 2025
Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages Jannik Brinkmann Chris Wendler Christian Bartelt Aaron Mueller 38 9 0 10 Jan 2025
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words Gouki Minegishi Hiroki Furuta Yusuke Iwasawa Y. Matsuo 40 1 0 09 Jan 2025
Graph-Aware Isomorphic Attention for Adaptive Dynamics in Transformers Markus J. Buehler AI4CE 33 1 0 04 Jan 2025
Tracking the Feature Dynamics in LLM Training: A Mechanistic Study Yang Xu Y. Wang Hao Wang 46 1 0 23 Dec 2024
Transformers Use Causal World Models in Maze-Solving Tasks Alex F Spies William Edwards Michael I. Ivanitskiy Adrians Skapars Tilman Rauker Katsumi Inoue A. Russo Murray Shanahan 84 1 0 16 Dec 2024
Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders Charles OÑeill David Klindt David Klindt 64 1 0 20 Nov 2024
JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit Zeqing He Zhibo Wang Zhixuan Chu Huiyu Xu Rui Zheng Kui Ren Chun Chen 36 3 0 17 Nov 2024
A Mechanistic Explanatory Strategy for XAI Marcin Rabiza 30 1 0 02 Nov 2024
Beyond Label Attention: Transparency in Language Models for Automated Medical Coding via Dictionary Learning John Wu David Wu Jimeng Sun 20 1 0 31 Oct 2024
Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering Yu Zhao Alessio Devoto Giwon Hong Xiaotang Du Aryo Pradipta Gema Hongru Wang Xuanli He Kam-Fai Wong Pasquale Minervini KELM LLMSV 27 14 0 21 Oct 2024
Decomposing The Dark Matter of Sparse Autoencoders Joshua Engels Logan Riggs Max Tegmark LLMSV 37 9 0 18 Oct 2024
Improving Instruction-Following in Language Models through Activation Steering Alessandro Stolfo Vidhisha Balachandran Safoora Yousefi Eric Horvitz Besmira Nushi LLMSV 40 13 0 15 Oct 2024
Analyzing (In)Abilities of SAEs via Formal Languages Abhinav Menon Manish Shrivastava David M. Krueger Ekdeep Singh Lubana 32 7 0 15 Oct 2024
The Geometry of Concepts: Sparse Autoencoder Feature Structure Yuxiao Li Eric J. Michaud David D. Baek Joshua Engels Xiaoqing Sun Max Tegmark 28 7 0 10 Oct 2024
Towards Interpreting Visual Information Processing in Vision-Language Models Clement Neo Luke Ong Philip H. S. Torr Mor Geva David M. Krueger Fazl Barez 78 6 0 09 Oct 2024
From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning Wei Chen Zhen Huang Liang Xie Binbin Lin Houqiang Li ... Deng Cai Yonggang Zhang Wenxiao Wang Xu Shen Jieping Ye 25 6 0 03 Sep 2024
Knowledge in Superposition: Unveiling the Failures of Lifelong Knowledge Editing for Large Language Models Chenhui Hu Pengfei Cao Yubo Chen Kang Liu Jun Zhao KELM 41 2 0 14 Aug 2024
Representing Rule-based Chatbots with Transformers Dan Friedman Abhishek Panigrahi Danqi Chen 44 1 0 15 Jul 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 49 18 0 02 Jul 2024
Finding Transformer Circuits with Edge Pruning Adithya Bhaskar Alexander Wettig Dan Friedman Danqi Chen 44 14 0 24 Jun 2024
Do Not Design, Learn: A Trainable Scoring Function for Uncertainty Estimation in Generative LLMs D. Yaldiz Yavuz Faruk Bakman Baturalp Buyukates Chenyang Tao Anil Ramakrishna Dimitrios Dimitriadis Jieyu Zhao Salman Avestimehr 30 1 0 17 Jun 2024
Talking Heads: Understanding Inter-layer Communication in Transformer Language Models Jack Merullo Carsten Eickhoff Ellie Pavlick 38 2 0 13 Jun 2024
Knowledge Circuits in Pretrained Transformers Yunzhi Yao Ningyu Zhang Zekun Xi Meng Wang Ziwen Xu Shumin Deng Huajun Chen KELM 45 19 0 28 May 2024
KAN: Kolmogorov-Arnold Networks Ziming Liu Yixuan Wang Sachin Vaidya Fabian Ruehle James Halverson Marin Soljacic Thomas Y. Hou Max Tegmark 32 444 0 30 Apr 2024
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 205 486 0 01 Nov 2022
Toy Models of Superposition Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer T. Henighan ... Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg C. Olah AAML MILM 117 314 0 21 Sep 2022
Linear Adversarial Concept Erasure Shauli Ravfogel Michael Twiton Yoav Goldberg Ryan Cotterell KELM 62 56 0 28 Jan 2022
The Pile: An 800GB Dataset of Diverse Text for Language Modeling Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe ... Horace He Anish Thite Noa Nabeshima Shawn Presser Connor Leahy AIMat 236 1,508 0 31 Dec 2020