Interpretability at Scale: Identifying Causal Mechanisms in Alpaca

15 May 2023

Papers citing "Interpretability at Scale: Identifying Causal Mechanisms in Alpaca"

50 / 69 papers shown

Title
Understanding In-context Learning of Addition via Activation Subspaces Xinyan Hu Kayo Yin Michael I. Jordan Jacob Steinhardt Lijie Chen 51 0 0 08 May 2025
MIB: A Mechanistic Interpretability Benchmark Aaron Mueller Atticus Geiger Sarah Wiegreffe Dana Arad Iván Arcuschin ... Alessandro Stolfo Martin Tutek Amir Zur David Bau Yonatan Belinkov 41 1 0 17 Apr 2025
On the Effectiveness and Generalization of Race Representations for Debiasing High-Stakes Decisions Dang Nguyen Chenhao Tan 32 0 0 07 Apr 2025
Combining Causal Models for More Accurate Abstractions of Neural Networks Theodora-Mara Pîslar Sara Magliacane Atticus Geiger AI4CE 50 0 0 14 Mar 2025
HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks Jiuding Sun Jing Huang Sidharth Baskaran Karel DÓosterlinck Christopher Potts Michael Sklar Atticus Geiger AI4CE 60 0 0 13 Mar 2025
How Transformers Solve Propositional Logic Problems: A Mechanistic Analysis Guan Zhe Hong Nishanth Dikkala Enming Luo Cyrus Rashtchian Xin Wang Rina Panigrahy OffRL LRM NAI 29 0 0 06 Nov 2024
Causal Abstraction in Model Interpretability: A Compact Survey Yihao Zhang 26 0 0 26 Oct 2024
Racing Thoughts: Explaining Contextualization Errors in Large Language Models Michael A. Lepori Michael Mozer Asma Ghandeharioun LRM 80 1 0 02 Oct 2024
GP-GPT: Large Language Model for Gene-Phenotype Mapping Yanjun Lyu Zihao Wu Lu Zhang Jing Zhang Yiwei Li ... Rongjie Liu Chao Huang Wentao Li Tianming Liu Dajiang Zhu LM&MA 25 3 0 15 Sep 2024
Interpreting and Improving Large Language Models in Arithmetic Calculation Wei Zhang Chaoqun Wan Yonggang Zhang Yiu-ming Cheung Xinmei Tian Xu Shen Jieping Ye LRM 24 18 0 03 Sep 2024
Personality Alignment of Large Language Models Minjun Zhu Linyi Yang Yue Zhang Yue Zhang ALM 57 5 0 21 Aug 2024
The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability Aaron Mueller Jannik Brinkmann Millicent Li Samuel Marks Koyena Pal ... Arnab Sen Sharma Jiuding Sun Eric Todd David Bau Yonatan Belinkov CML 42 18 0 02 Aug 2024
XAI meets LLMs: A Survey of the Relation between Explainable AI and Large Language Models Erik Cambria Lorenzo Malandri Fabio Mercorio Navid Nobani Andrea Seveso 48 11 0 21 Jul 2024
InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques Rohan Gupta Iván Arcuschin Thomas Kwa Adrià Garriga-Alonso 45 3 0 19 Jul 2024
Mechanistically Interpreting a Transformer-based 2-SAT Solver: An Axiomatic Approach Nils Palumbo Ravi Mangal Zifan Wang Saranya Vijayakumar Corina S. Pasareanu Somesh Jha 36 1 0 18 Jul 2024
NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals Jaden Fiotto-Kaufman Alexander R. Loftus Eric Todd Jannik Brinkmann Caden Juang ... Carla Brodley Arjun Guha Jonathan Bell Byron C. Wallace David Bau 29 2 0 18 Jul 2024
Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks Aaron Mueller CML 23 10 0 05 Jul 2024
Functional Faithfulness in the Wild: Circuit Discovery with Differentiable Computation Graph Pruning Lei Yu Jingcheng Niu Zining Zhu Gerald Penn 31 5 0 04 Jul 2024
Towards Compositionality in Concept Learning Adam Stein Aaditya Naik Yinjun Wu Mayur Naik Eric Wong CoGe 37 2 0 26 Jun 2024
Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers Yibo Jiang Goutham Rajendran Pradeep Ravikumar Bryon Aragam CLL KELM 29 6 0 26 Jun 2024
Finding Transformer Circuits with Edge Pruning Adithya Bhaskar Alexander Wettig Dan Friedman Danqi Chen 58 16 0 24 Jun 2024
Beyond the Doors of Perception: Vision Transformers Represent Relations Between Objects Michael A. Lepori Alexa R. Tartaglini Wai Keen Vong Thomas Serre Brenden Lake Ellie Pavlick 34 2 0 22 Jun 2024
Distributional reasoning in LLMs: Parallel reasoning processes in multi-hop reasoning Yuval Shalev Amir Feder Ariel Goldstein LRM 32 4 0 19 Jun 2024
GPT-ology, Computational Models, Silicon Sampling: How should we think about LLMs in Cognitive Science? Desmond C. Ong 44 3 0 13 Jun 2024
Learning Causal Abstractions of Linear Structural Causal Models Riccardo Massidda Sara Magliacane Davide Bacciu CML 45 2 0 01 Jun 2024
InversionView: A General-Purpose Method for Reading Information from Neural Activations Xinting Huang Madhur Panwar Navin Goyal Michael Hahn 26 3 0 27 May 2024
From Frege to chatGPT: Compositionality in language, cognition, and deep neural networks Jacob Russin Sam Whitman McGrath Danielle J. Williams Lotem Elber-Dorozko AI4CE 61 3 0 24 May 2024
Can Language Models Explain Their Own Classification Behavior? Dane Sherburn Bilal Chughtai Owain Evans 28 1 0 13 May 2024
Learned feature representations are biased by complexity, learning order, position, and more Andrew Kyle Lampinen Stephanie C. Y. Chan Katherine Hermann AI4CE FaML SSL OOD 32 6 0 09 May 2024
A Philosophical Introduction to Language Models - Part II: The Way Forward Raphael Milliere Cameron Buckner LRM 52 13 0 06 May 2024
What does the Knowledge Neuron Thesis Have to do with Knowledge? Jingcheng Niu Andrew Liu Zining Zhu Gerald Penn 36 30 0 03 May 2024
Mechanistic Interpretability for AI Safety -- A Review Leonard Bereska E. Gavves AI4CE 38 111 0 22 Apr 2024
ReFT: Representation Finetuning for Language Models Zhengxuan Wu Aryaman Arora Zheng Wang Atticus Geiger Daniel Jurafsky Christopher D. Manning Christopher Potts OffRL 30 58 0 04 Apr 2024
AI and the Problem of Knowledge Collapse Andrew J. Peterson 38 17 0 04 Apr 2024
From Explainable to Interpretable Deep Learning for Natural Language Processing in Healthcare: How Far from Reality? Guangming Huang Yingya Li Shoaib Jameel Yunfei Long G. Papanastasiou 26 16 0 18 Mar 2024
Large Language Models and Causal Inference in Collaboration: A Survey Xiaoyu Liu Paiheng Xu Junda Wu Jiaxin Yuan Yifan Yang ... Haoliang Wang Tong Yu Julian McAuley Wei Ai Furong Huang ELM LRM 72 35 0 14 Mar 2024
How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning Subhabrata Dutta Joykirat Singh Soumen Chakrabarti Tanmoy Chakraborty LRM 30 23 0 28 Feb 2024
Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking Nikhil Prakash Tamar Rott Shaham Tal Haklay Yonatan Belinkov David Bau 41 52 0 22 Feb 2024
CausalGym: Benchmarking causal interpretability methods on linguistic tasks Aryaman Arora Daniel Jurafsky Christopher Potts 50 21 0 19 Feb 2024
Understanding Reasoning Ability of Language Models From the Perspective of Reasoning Paths Aggregation Xinyi Wang Alfonso Amayuelas Kexun Zhang Liangming Pan Wenhu Chen W. Wang LRM 32 11 0 05 Feb 2024
Rethinking Interpretability in the Era of Large Language Models Chandan Singh J. Inala Michel Galley Rich Caruana Jianfeng Gao LRM AI4CE 75 60 0 30 Jan 2024
A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments Zhengxuan Wu Atticus Geiger Jing-ling Huang Aryaman Arora Thomas F. Icard Christopher Potts Noah D. Goodman 28 6 0 23 Jan 2024
Are Language Models More Like Libraries or Like Librarians? Bibliotechnism, the Novel Reference Problem, and the Attitudes of LLMs Harvey Lederman Kyle Mahowald 16 10 0 10 Jan 2024
Successor Heads: Recurring, Interpretable Attention Heads In The Wild Rhys Gould Euan Ong George Ogden Arthur Conmy LRM 8 44 0 14 Dec 2023
Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models Alexandre Variengien Eric Winsor LRM ReLM 74 10 0 13 Dec 2023
A Glitch in the Matrix? Locating and Detecting Language Model Grounding with Fakepedia Giovanni Monea Maxime Peyrard Martin Josifoski Vishrav Chaudhary Jason Eisner Emre Kiciman Hamid Palangi Barun Patra Robert West KELM 47 12 0 04 Dec 2023
Flexible Model Interpretability through Natural Language Model Editing Karel DÓosterlinck Thomas Demeester Chris Develder Christopher Potts MILM KELM 10 0 0 17 Nov 2023
Uncovering Intermediate Variables in Transformers using Circuit Probing Michael A. Lepori Thomas Serre Ellie Pavlick 70 7 0 07 Nov 2023
How do Language Models Bind Entities in Context? Jiahai Feng Jacob Steinhardt 9 34 0 26 Oct 2023
Towards a Mechanistic Interpretation of Multi-Step Reasoning Capabilities of Language Models Yifan Hou Jiaoda Li Yu Fei Alessandro Stolfo Wangchunshu Zhou Guangtao Zeng Antoine Bosselut Mrinmaya Sachan LRM 30 39 0 23 Oct 2023