Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2305.15054
Cited By
A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis
24 May 2023
Alessandro Stolfo
Yonatan Belinkov
Mrinmaya Sachan
MILM
KELM
LRM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis"
44 / 44 papers shown
Title
Understanding In-context Learning of Addition via Activation Subspaces
Xinyan Hu
Kayo Yin
Michael I. Jordan
Jacob Steinhardt
Lijie Chen
49
0
0
08 May 2025
Bigram Subnetworks: Mapping to Next Tokens in Transformer Language Models
Tyler A. Chang
Benjamin Bergen
46
0
0
21 Apr 2025
MIB: A Mechanistic Interpretability Benchmark
Aaron Mueller
Atticus Geiger
Sarah Wiegreffe
Dana Arad
Iván Arcuschin
...
Alessandro Stolfo
Martin Tutek
Amir Zur
David Bau
Yonatan Belinkov
41
1
0
17 Apr 2025
HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks
Jiuding Sun
Jing Huang
Sidharth Baskaran
Karel DÓosterlinck
Christopher Potts
Michael Sklar
Atticus Geiger
AI4CE
60
0
0
13 Mar 2025
Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis
X. Wang
Yan Hu
Wenyu Du
Reynold Cheng
Benyou Wang
Difan Zou
51
0
0
17 Feb 2025
What is a Number, That a Large Language Model May Know It?
Raja Marjieh
Veniamin Veselovsky
Thomas L. Griffiths
Ilia Sucholutsky
111
2
0
03 Feb 2025
Think or Remember? Detecting and Directing LLMs Towards Memorization or Generalization
Yi-Fu Fu
Yu-Chieh Tu
Tzu-Ling Cheng
Cheng-Yu Lin
Yi-Ting Yang
Heng-Yi Liu
Keng-Te Liao
Da-Cheng Juan
Shou-de Lin
41
0
0
24 Dec 2024
Adaptive Circuit Behavior and Generalization in Mechanistic Interpretability
Jatin Nainani
Sankaran Vaidyanathan
AJ Yeung
Kartik Gupta
David Jensen
AI4CE
71
0
0
25 Nov 2024
Unraveling Arithmetic in Large Language Models: The Role of Algebraic Structures
Fu-Chieh Chang
Pei-Yuan Wu
Pei-Yuan Wu
LRM
101
1
0
25 Nov 2024
Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering
Zeping Yu
Sophia Ananiadou
91
0
0
17 Nov 2024
Information Anxiety in Large Language Models
Prasoon Bajpai
Sarah Masud
Tanmoy Chakraborty
37
0
0
16 Nov 2024
Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics
Yaniv Nikankin
Anja Reusch
Aaron Mueller
Yonatan Belinkov
AIFin
LRM
33
21
0
28 Oct 2024
Looking Beyond The Top-1: Transformers Determine Top Tokens In Order
Daria Lioubashevski
Tomer Schlank
Gabriel Stanovsky
Ariel Goldstein
29
1
0
26 Oct 2024
On the Role of Attention Heads in Large Language Model Safety
Z. Zhou
Haiyang Yu
Xinghua Zhang
Rongwu Xu
Fei Huang
Kun Wang
Yang Liu
Junfeng Fang
Yongbin Li
57
5
0
17 Oct 2024
AERO: Softmax-Only LLMs for Efficient Private Inference
N. Jha
Brandon Reagen
27
1
0
16 Oct 2024
MIRAGE: Evaluating and Explaining Inductive Reasoning Process in Language Models
Jiachun Li
Pengfei Cao
Zhuoran Jin
Yubo Chen
Kang-Jun Liu
Jun Zhao
LRM
ELM
32
4
0
12 Oct 2024
Unlearning-based Neural Interpretations
Ching Lam Choi
Alexandre Duplessis
Serge Belongie
FAtt
42
0
0
10 Oct 2024
Towards Interpreting Visual Information Processing in Vision-Language Models
Clement Neo
Luke Ong
Philip H. S. Torr
Mor Geva
David M. Krueger
Fazl Barez
84
6
0
09 Oct 2024
Listening to the Wise Few: Select-and-Copy Attention Heads for Multiple-Choice QA
Eduard Tulchinskii
Laida Kushnareva
Kristian Kuznetsov
Anastasia Voznyuk
Andrei Andriiainen
Irina Piontkovskaya
Evgeny Burnaev
Serguei Barannikov
65
1
0
03 Oct 2024
Optimal ablation for interpretability
Maximilian Li
Lucas Janson
FAtt
44
2
0
16 Sep 2024
Attention Heads of Large Language Models: A Survey
Zifan Zheng
Yezhaohui Wang
Yuxin Huang
Shichao Song
Mingchuan Yang
Bo Tang
Feiyu Xiong
Zhiyu Li
LRM
52
21
0
05 Sep 2024
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Tom Lieberum
Senthooran Rajamanoharan
Arthur Conmy
Lewis Smith
Nicolas Sonnerat
Vikrant Varma
János Kramár
Anca Dragan
Rohin Shah
Neel Nanda
16
81
0
09 Aug 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
Daking Rai
Yilun Zhou
Shi Feng
Abulhair Saparov
Ziyu Yao
75
18
0
02 Jul 2024
Memory
3
\text{Memory}^3
Memory
3
: Language Modeling with Explicit Memory
Hongkang Yang
Zehao Lin
Wenjin Wang
Hao Wu
Zhiyu Li
...
Yu Yu
Kai Chen
Feiyu Xiong
Linpeng Tang
Weinan E
48
11
0
01 Jul 2024
Transformer Normalisation Layers and the Independence of Semantic Subspaces
S. Menary
Samuel Kaski
Andre Freitas
41
2
0
25 Jun 2024
Confidence Regulation Neurons in Language Models
Alessandro Stolfo
Ben Wu
Wes Gurnee
Yonatan Belinkov
Xingyi Song
Mrinmaya Sachan
Neel Nanda
29
12
0
24 Jun 2024
Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces
Yihuai Hong
Lei Yu
Shauli Ravfogel
Haiqin Yang
Mor Geva
KELM
MU
58
17
0
17 Jun 2024
InversionView: A General-Purpose Method for Reading Information from Neural Activations
Xinting Huang
Madhur Panwar
Navin Goyal
Michael Hahn
26
3
0
27 May 2024
How to use and interpret activation patching
Stefan Heimersheim
Neel Nanda
19
36
0
23 Apr 2024
Mechanistic Interpretability for AI Safety -- A Review
Leonard Bereska
E. Gavves
AI4CE
38
111
0
22 Apr 2024
Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms
Michael Hanna
Sandro Pezzelle
Yonatan Belinkov
51
34
0
26 Mar 2024
AtP*: An efficient and scalable method for localizing LLM behaviour to components
János Kramár
Tom Lieberum
Rohin Shah
Neel Nanda
KELM
43
42
0
01 Mar 2024
Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs
Aaditya K. Singh
DJ Strouse
38
46
0
22 Feb 2024
Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models
Asma Ghandeharioun
Avi Caciularu
Adam Pearce
Lucas Dixon
Mor Geva
25
87
0
11 Jan 2024
Arithmetic with Language Models: from Memorization to Computation
Davide Maltoni
Matteo Ferrara
KELM
LRM
22
4
0
02 Aug 2023
Finding Neurons in a Haystack: Case Studies with Sparse Probing
Wes Gurnee
Neel Nanda
Matthew Pauly
Katherine Harvey
Dmitrii Troitskii
Dimitris Bertsimas
MILM
153
186
0
02 May 2023
Dissecting Recall of Factual Associations in Auto-Regressive Language Models
Mor Geva
Jasmijn Bastings
Katja Filippova
Amir Globerson
KELM
189
261
0
28 Apr 2023
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sébastien Bubeck
Varun Chandrasekaran
Ronen Eldan
J. Gehrke
Eric Horvitz
...
Scott M. Lundberg
Harsha Nori
Hamid Palangi
Marco Tulio Ribeiro
Yi Zhang
ELM
AI4MH
AI4CE
ALM
239
2,232
0
22 Mar 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
210
491
0
01 Nov 2022
In-context Learning and Induction Heads
Catherine Olsson
Nelson Elhage
Neel Nanda
Nicholas Joseph
Nova Dassarma
...
Tom B. Brown
Jack Clark
Jared Kaplan
Sam McCandlish
C. Olah
240
456
0
24 Sep 2022
Large Language Models are Zero-Shot Reasoners
Takeshi Kojima
S. Gu
Machel Reid
Yutaka Matsuo
Yusuke Iwasawa
ReLM
LRM
291
4,048
0
24 May 2022
Investigating Numeracy Learning Ability of a Text-to-Text Transfer Model
Kuntal Kumar Pal
Chitta Baral
85
18
0
10 Sep 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
248
1,986
0
31 Dec 2020
Language Models as Knowledge Bases?
Fabio Petroni
Tim Rocktaschel
Patrick Lewis
A. Bakhtin
Yuxiang Wu
Alexander H. Miller
Sebastian Riedel
KELM
AI4MH
406
2,576
0
03 Sep 2019
1