Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2305.08809
Cited By
Interpretability at Scale: Identifying Causal Mechanisms in Alpaca
15 May 2023
Zhengxuan Wu
Atticus Geiger
Thomas Icard
Christopher Potts
Noah D. Goodman
MILM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Interpretability at Scale: Identifying Causal Mechanisms in Alpaca"
50 / 69 papers shown
Title
Understanding In-context Learning of Addition via Activation Subspaces
Xinyan Hu
Kayo Yin
Michael I. Jordan
Jacob Steinhardt
Lijie Chen
51
0
0
08 May 2025
MIB: A Mechanistic Interpretability Benchmark
Aaron Mueller
Atticus Geiger
Sarah Wiegreffe
Dana Arad
Iván Arcuschin
...
Alessandro Stolfo
Martin Tutek
Amir Zur
David Bau
Yonatan Belinkov
41
1
0
17 Apr 2025
On the Effectiveness and Generalization of Race Representations for Debiasing High-Stakes Decisions
Dang Nguyen
Chenhao Tan
32
0
0
07 Apr 2025
Combining Causal Models for More Accurate Abstractions of Neural Networks
Theodora-Mara Pîslar
Sara Magliacane
Atticus Geiger
AI4CE
50
0
0
14 Mar 2025
HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks
Jiuding Sun
Jing Huang
Sidharth Baskaran
Karel DÓosterlinck
Christopher Potts
Michael Sklar
Atticus Geiger
AI4CE
60
0
0
13 Mar 2025
How Transformers Solve Propositional Logic Problems: A Mechanistic Analysis
Guan Zhe Hong
Nishanth Dikkala
Enming Luo
Cyrus Rashtchian
Xin Wang
Rina Panigrahy
OffRL
LRM
NAI
29
0
0
06 Nov 2024
Causal Abstraction in Model Interpretability: A Compact Survey
Yihao Zhang
26
0
0
26 Oct 2024
Racing Thoughts: Explaining Contextualization Errors in Large Language Models
Michael A. Lepori
Michael Mozer
Asma Ghandeharioun
LRM
80
1
0
02 Oct 2024
GP-GPT: Large Language Model for Gene-Phenotype Mapping
Yanjun Lyu
Zihao Wu
Lu Zhang
Jing Zhang
Yiwei Li
...
Rongjie Liu
Chao Huang
Wentao Li
Tianming Liu
Dajiang Zhu
LM&MA
25
3
0
15 Sep 2024
Interpreting and Improving Large Language Models in Arithmetic Calculation
Wei Zhang
Chaoqun Wan
Yonggang Zhang
Yiu-ming Cheung
Xinmei Tian
Xu Shen
Jieping Ye
LRM
24
18
0
03 Sep 2024
Personality Alignment of Large Language Models
Minjun Zhu
Linyi Yang
Yue Zhang
Yue Zhang
ALM
57
5
0
21 Aug 2024
The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability
Aaron Mueller
Jannik Brinkmann
Millicent Li
Samuel Marks
Koyena Pal
...
Arnab Sen Sharma
Jiuding Sun
Eric Todd
David Bau
Yonatan Belinkov
CML
42
18
0
02 Aug 2024
XAI meets LLMs: A Survey of the Relation between Explainable AI and Large Language Models
Erik Cambria
Lorenzo Malandri
Fabio Mercorio
Navid Nobani
Andrea Seveso
48
11
0
21 Jul 2024
InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
Rohan Gupta
Iván Arcuschin
Thomas Kwa
Adrià Garriga-Alonso
45
3
0
19 Jul 2024
Mechanistically Interpreting a Transformer-based 2-SAT Solver: An Axiomatic Approach
Nils Palumbo
Ravi Mangal
Zifan Wang
Saranya Vijayakumar
Corina S. Pasareanu
Somesh Jha
36
1
0
18 Jul 2024
NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals
Jaden Fiotto-Kaufman
Alexander R. Loftus
Eric Todd
Jannik Brinkmann
Caden Juang
...
Carla Brodley
Arjun Guha
Jonathan Bell
Byron C. Wallace
David Bau
29
2
0
18 Jul 2024
Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks
Aaron Mueller
CML
23
10
0
05 Jul 2024
Functional Faithfulness in the Wild: Circuit Discovery with Differentiable Computation Graph Pruning
Lei Yu
Jingcheng Niu
Zining Zhu
Gerald Penn
31
5
0
04 Jul 2024
Towards Compositionality in Concept Learning
Adam Stein
Aaditya Naik
Yinjun Wu
Mayur Naik
Eric Wong
CoGe
37
2
0
26 Jun 2024
Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers
Yibo Jiang
Goutham Rajendran
Pradeep Ravikumar
Bryon Aragam
CLL
KELM
29
6
0
26 Jun 2024
Finding Transformer Circuits with Edge Pruning
Adithya Bhaskar
Alexander Wettig
Dan Friedman
Danqi Chen
58
16
0
24 Jun 2024
Beyond the Doors of Perception: Vision Transformers Represent Relations Between Objects
Michael A. Lepori
Alexa R. Tartaglini
Wai Keen Vong
Thomas Serre
Brenden Lake
Ellie Pavlick
34
2
0
22 Jun 2024
Distributional reasoning in LLMs: Parallel reasoning processes in multi-hop reasoning
Yuval Shalev
Amir Feder
Ariel Goldstein
LRM
32
4
0
19 Jun 2024
GPT-ology, Computational Models, Silicon Sampling: How should we think about LLMs in Cognitive Science?
Desmond C. Ong
44
3
0
13 Jun 2024
Learning Causal Abstractions of Linear Structural Causal Models
Riccardo Massidda
Sara Magliacane
Davide Bacciu
CML
45
2
0
01 Jun 2024
InversionView: A General-Purpose Method for Reading Information from Neural Activations
Xinting Huang
Madhur Panwar
Navin Goyal
Michael Hahn
26
3
0
27 May 2024
From Frege to chatGPT: Compositionality in language, cognition, and deep neural networks
Jacob Russin
Sam Whitman McGrath
Danielle J. Williams
Lotem Elber-Dorozko
AI4CE
61
3
0
24 May 2024
Can Language Models Explain Their Own Classification Behavior?
Dane Sherburn
Bilal Chughtai
Owain Evans
28
1
0
13 May 2024
Learned feature representations are biased by complexity, learning order, position, and more
Andrew Kyle Lampinen
Stephanie C. Y. Chan
Katherine Hermann
AI4CE
FaML
SSL
OOD
32
6
0
09 May 2024
A Philosophical Introduction to Language Models - Part II: The Way Forward
Raphael Milliere
Cameron Buckner
LRM
52
13
0
06 May 2024
What does the Knowledge Neuron Thesis Have to do with Knowledge?
Jingcheng Niu
Andrew Liu
Zining Zhu
Gerald Penn
36
30
0
03 May 2024
Mechanistic Interpretability for AI Safety -- A Review
Leonard Bereska
E. Gavves
AI4CE
38
111
0
22 Apr 2024
ReFT: Representation Finetuning for Language Models
Zhengxuan Wu
Aryaman Arora
Zheng Wang
Atticus Geiger
Daniel Jurafsky
Christopher D. Manning
Christopher Potts
OffRL
30
58
0
04 Apr 2024
AI and the Problem of Knowledge Collapse
Andrew J. Peterson
38
17
0
04 Apr 2024
From Explainable to Interpretable Deep Learning for Natural Language Processing in Healthcare: How Far from Reality?
Guangming Huang
Yingya Li
Shoaib Jameel
Yunfei Long
G. Papanastasiou
26
16
0
18 Mar 2024
Large Language Models and Causal Inference in Collaboration: A Survey
Xiaoyu Liu
Paiheng Xu
Junda Wu
Jiaxin Yuan
Yifan Yang
...
Haoliang Wang
Tong Yu
Julian McAuley
Wei Ai
Furong Huang
ELM
LRM
72
35
0
14 Mar 2024
How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning
Subhabrata Dutta
Joykirat Singh
Soumen Chakrabarti
Tanmoy Chakraborty
LRM
30
23
0
28 Feb 2024
Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking
Nikhil Prakash
Tamar Rott Shaham
Tal Haklay
Yonatan Belinkov
David Bau
41
52
0
22 Feb 2024
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
Aryaman Arora
Daniel Jurafsky
Christopher Potts
50
21
0
19 Feb 2024
Understanding Reasoning Ability of Language Models From the Perspective of Reasoning Paths Aggregation
Xinyi Wang
Alfonso Amayuelas
Kexun Zhang
Liangming Pan
Wenhu Chen
W. Wang
LRM
32
11
0
05 Feb 2024
Rethinking Interpretability in the Era of Large Language Models
Chandan Singh
J. Inala
Michel Galley
Rich Caruana
Jianfeng Gao
LRM
AI4CE
75
60
0
30 Jan 2024
A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments
Zhengxuan Wu
Atticus Geiger
Jing-ling Huang
Aryaman Arora
Thomas F. Icard
Christopher Potts
Noah D. Goodman
28
6
0
23 Jan 2024
Are Language Models More Like Libraries or Like Librarians? Bibliotechnism, the Novel Reference Problem, and the Attitudes of LLMs
Harvey Lederman
Kyle Mahowald
16
10
0
10 Jan 2024
Successor Heads: Recurring, Interpretable Attention Heads In The Wild
Rhys Gould
Euan Ong
George Ogden
Arthur Conmy
LRM
8
44
0
14 Dec 2023
Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models
Alexandre Variengien
Eric Winsor
LRM
ReLM
74
10
0
13 Dec 2023
A Glitch in the Matrix? Locating and Detecting Language Model Grounding with Fakepedia
Giovanni Monea
Maxime Peyrard
Martin Josifoski
Vishrav Chaudhary
Jason Eisner
Emre Kiciman
Hamid Palangi
Barun Patra
Robert West
KELM
47
12
0
04 Dec 2023
Flexible Model Interpretability through Natural Language Model Editing
Karel DÓosterlinck
Thomas Demeester
Chris Develder
Christopher Potts
MILM
KELM
10
0
0
17 Nov 2023
Uncovering Intermediate Variables in Transformers using Circuit Probing
Michael A. Lepori
Thomas Serre
Ellie Pavlick
70
7
0
07 Nov 2023
How do Language Models Bind Entities in Context?
Jiahai Feng
Jacob Steinhardt
9
34
0
26 Oct 2023
Towards a Mechanistic Interpretation of Multi-Step Reasoning Capabilities of Language Models
Yifan Hou
Jiaoda Li
Yu Fei
Alessandro Stolfo
Wangchunshu Zhou
Guangtao Zeng
Antoine Bosselut
Mrinmaya Sachan
LRM
30
39
0
23 Oct 2023
1
2
Next