A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models

2 July 2024

Papers citing "A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models"

30 / 30 papers shown

Title
In-Context Learning can distort the relationship between sequence likelihoods and biological fitness Pranav Kantroo Günter P. Wagner Benjamin B. Machta 25 0 0 23 Apr 2025
Understanding the Skill Gap in Recurrent Language Models: The Role of the Gather-and-Aggregate Mechanism Aviv Bick Eric P. Xing Albert Gu RALM 81 0 0 22 Apr 2025
Layers at Similar Depths Generate Similar Activations Across LLM Architectures Christopher Wolfram Aaron Schein 18 0 0 03 Apr 2025
Reasoning about Affordances: Causal and Compositional Reasoning in LLMs Magnus F. Gjerde Vanessa Cheung David Lagnado ReLM LRM 39 0 0 23 Feb 2025
SAE-V: Interpreting Multimodal Models for Enhanced Alignment Hantao Lou Changye Li Jiaming Ji Yaodong Yang 34 0 0 22 Feb 2025
An explainable transformer circuit for compositional generalization Cheng Tang Brenden Lake Mehrdad Jazayeri LRM 33 0 0 19 Feb 2025
Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis X. Wang Yan Hu Wenyu Du Reynold Cheng Benyou Wang Difan Zou 48 0 0 17 Feb 2025
Deciphering Functions of Neurons in Vision-Language Models Jiaqi Xu Cuiling Lan Xuejin Chen Yan Lu VLM 61 0 0 10 Feb 2025
Interpretable Language Modeling via Induction-head Ngram Models Eunji Kim Sriya Mantena Weiwei Yang Chandan Singh Sungroh Yoon Jianfeng Gao 29 0 0 31 Oct 2024
Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders Viacheslav Surkov Chris Wendler Mikhail Terekhov Justin Deschenaux Robert West Çağlar Gülçehre VLM 21 12 0 28 Oct 2024
Enforcing Interpretability in Time Series Transformers: A Concept Bottleneck Framework Angela van Sprang Erman Acar Willem Zuidema AI4TS 22 1 0 08 Oct 2024
System 2 Reasoning Capabilities Are Nigh Scott C. Lowe VLM LRM 19 0 0 04 Oct 2024
Listening to the Wise Few: Select-and-Copy Attention Heads for Multiple-Choice QA Eduard Tulchinskii Laida Kushnareva Kristian Kuznetsov Anastasia Voznyuk Andrei Andriiainen Irina Piontkovskaya Evgeny Burnaev Serguei Barannikov 51 1 0 03 Oct 2024
Locating and Editing Factual Associations in Mamba Arnab Sen Sharma David Atkinson David Bau KELM 62 16 0 04 Apr 2024
AtP*: An efficient and scalable method for localizing LLM behaviour to components János Kramár Tom Lieberum Rohin Shah Neel Nanda KELM 28 40 0 01 Mar 2024
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations Jing-ling Huang Zhengxuan Wu Christopher Potts Mor Geva Atticus Geiger 38 24 0 27 Feb 2024
Unified View of Grokking, Double Descent and Emergent Abilities: A Perspective from Circuits Competition Yufei Huang Shengding Hu Xu Han Zhiyuan Liu Maosong Sun 50 6 0 23 Feb 2024
Increasing Trust in Language Models through the Reuse of Verified Circuits Philip Quirke Clement Neo Fazl Barez KELM LRM 23 3 0 04 Feb 2024
Universal Neurons in GPT2 Language Models Wes Gurnee Theo Horsley Zifan Carl Guo Tara Rezaei Kheirkhah Qinyi Sun Will Hathaway Neel Nanda Dimitris Bertsimas MILM 75 37 0 22 Jan 2024
Attribution Patching Outperforms Automated Circuit Discovery Aaquib Syed Can Rager Arthur Conmy 39 18 0 16 Oct 2023
Finding Neurons in a Haystack: Case Studies with Sparse Probing Wes Gurnee Neel Nanda Matthew Pauly Katherine Harvey Dmitrii Troitskii Dimitris Bertsimas MILM 150 170 0 02 May 2023
Dissecting Recall of Factual Associations in Auto-Regressive Language Models Mor Geva Jasmijn Bastings Katja Filippova Amir Globerson KELM 180 152 0 28 Apr 2023
Sparks of Artificial General Intelligence: Early experiments with GPT-4 Sébastien Bubeck Varun Chandrasekaran Ronen Eldan J. Gehrke Eric Horvitz ... Scott M. Lundberg Harsha Nori Hamid Palangi Marco Tulio Ribeiro Yi Zhang ELM AI4MH AI4CE ALM 197 2,232 0 22 Mar 2023
Crawling the Internal Knowledge-Base of Language Models Roi Cohen Mor Geva Jonathan Berant Amir Globerson 162 74 0 30 Jan 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 205 486 0 01 Nov 2022
In-context Learning and Induction Heads Catherine Olsson Nelson Elhage Neel Nanda Nicholas Joseph Nova Dassarma ... Tom B. Brown Jack Clark Jared Kaplan Sam McCandlish C. Olah 232 453 0 24 Sep 2022
Text and Patterns: For Effective Chain of Thought, It Takes Two to Tango Aman Madaan Amir Yazdanbakhsh LRM 130 94 0 16 Sep 2022
Language Models as Knowledge Bases? Fabio Petroni Tim Rocktaschel Patrick Lewis A. Bakhtin Yuxiang Wu Alexander H. Miller Sebastian Riedel KELM AI4MH 386 2,216 0 03 Sep 2019
What you can cram into a single vector: Probing sentence embeddings for linguistic properties Alexis Conneau Germán Kruszewski Guillaume Lample Loïc Barrault Marco Baroni 196 876 0 03 May 2018
Towards A Rigorous Science of Interpretable Machine Learning Finale Doshi-Velez Been Kim XAI FaML 219 2,098 0 28 Feb 2017