Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2309.16042
Cited By
v1
v2 (latest)
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
International Conference on Learning Representations (ICLR), 2023
27 September 2023
Fred Zhang
Neel Nanda
LLMSV
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (4 upvotes)
Papers citing
"Towards Best Practices of Activation Patching in Language Models: Metrics and Methods"
50 / 127 papers shown
Title
Combining Causal Models for More Accurate Abstractions of Neural Networks
CLEaR (CLEaR), 2025
Theodora-Mara Pîslar
Sara Magliacane
Atticus Geiger
AI4CE
232
1
0
14 Mar 2025
Implicit Reasoning in Transformers is Reasoning through Shortcuts
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Tianhe Lin
Jian Xie
Siyu Yuan
Deqing Yang
ReLM
LRM
374
7
0
10 Mar 2025
(How) Do Language Models Track State?
Belinda Z. Li
Zifan Carl Guo
Jacob Andreas
LRM
383
9
0
04 Mar 2025
Neuroplasticity and Corruption in Model Mechanisms: A Case Study Of Indirect Object Identification
North American Chapter of the Association for Computational Linguistics (NAACL), 2025
Vishnu Kabir Chhabra
Ding Zhu
Mohammad Mahdi Khalili
289
5
0
27 Feb 2025
Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Michael Y. Hu
Jackson Petty
Chuan Shi
William Merrill
Tal Linzen
AI4CE
333
5
0
26 Feb 2025
Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations
Lucy Farnik
Tim Lawson
Conor Houghton
Laurence Aitchison
293
5
0
25 Feb 2025
Representation Engineering for Large-Language Models: Survey and Research Challenges
Lukasz Bartoszcze
Sarthak Munshi
Bryan Sukidi
Jennifer Yen
Zejia Yang
David Williams-King
Linh Le
Kosi Asuzu
Carsten Maple
348
4
0
24 Feb 2025
Quantifying Logical Consistency in Transformers via Query-Key Alignment
Eduard Tulchinskii
Anastasia Voznyuk
Laida Kushnareva
Andrei Andriiainen
Irina Piontkovskaya
Evgeny Burnaev
Serguei Barannikov
LRM
284
0
0
24 Feb 2025
The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It
Leonardo Bertolazzi
Philipp Mondorf
Yun Xue
Raffaella Bernardi
AIFin
LRM
433
1
0
17 Feb 2025
Exploring Translation Mechanism of Large Language Models
Hongbin Zhang
Kehai Chen
Xuefeng Bai
Xiucheng Li
Yang Xiang
Min Zhang
342
2
0
17 Feb 2025
Back Attention: Understanding and Enhancing Multi-Hop Reasoning in Large Language Models
Zeping Yu
Yonatan Belinkov
Sophia Ananiadou
LRM
206
10
0
15 Feb 2025
Padding Tone: A Mechanistic Analysis of Padding Tokens in T2I Models
North American Chapter of the Association for Computational Linguistics (NAACL), 2025
Michael Toker
Ido Galil
Hadas Orgad
Rinon Gal
Yoad Tewel
Gal Chechik
Yonatan Belinkov
DiffM
191
5
0
12 Jan 2025
Reversed Attention: On The Gradient Descent Of Attention Layers In GPT
Shahar Katz
Lior Wolf
100
0
0
22 Dec 2024
Think-to-Talk or Talk-to-Think? When LLMs Come Up with an Answer in Multi-Hop Arithmetic Reasoning
Keito Kudo
Yoichi Aoki
Tatsuki Kuribayashi
Shusaku Sone
Masaya Taniguchi
Ana Brassard
Keisuke Sakaguchi
Kentaro Inui
ReLM
LRM
354
0
0
02 Dec 2024
Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering
Zeping Yu
Sophia Ananiadou
998
8
0
17 Nov 2024
JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit
Zeqing He
Peng Kuang
Zhixuan Chu
Huiyu Xu
Rui Zheng
Kui Ren
Chun Chen
352
16
0
17 Nov 2024
Towards Unifying Interpretability and Control: Evaluation via Intervention
Usha Bhalla
Suraj Srinivas
Asma Ghandeharioun
Himabindu Lakkaraju
343
17
0
07 Nov 2024
A Implies B: Circuit Analysis in LLMs for Propositional Logical Reasoning
Guan Zhe Hong
Nishanth Dikkala
Enming Luo
Cyrus Rashtchian
Xin Wang
Rina Panigrahy
OffRL
LRM
NAI
344
0
0
06 Nov 2024
Do Mice Grok? Glimpses of Hidden Progress During Overtraining in Sensory Cortex
Tanishq Kumar
Blake Bordelon
Cengiz Pehlevan
Venkatesh N. Murthy
Samuel Gershman
OOD
CLL
SSL
319
0
0
05 Nov 2024
SPES: Spectrogram Perturbation for Explainable Speech-to-Text Generation
Dennis Fucci
Marco Gaido
Beatrice Savoldi
Matteo Negri
Mauro Cettolo
L. Bentivogli
531
5
0
03 Nov 2024
Abrupt Learning in Transformers: A Case Study on Matrix Completion
Neural Information Processing Systems (NeurIPS), 2024
Pulkit Gopalani
Ekdeep Singh Lubana
Wei Hu
159
7
0
29 Oct 2024
On the Role of Attention Heads in Large Language Model Safety
International Conference on Learning Representations (ICLR), 2024
Zhenhong Zhou
Haiyang Yu
Xinghua Zhang
Rongwu Xu
Fei Huang
Kun Wang
Yang Liu
Cunchun Li
Yongbin Li
408
36
0
17 Oct 2024
Hypothesis Testing the Circuit Hypothesis in LLMs
Neural Information Processing Systems (NeurIPS), 2024
Claudia Shi
Nicolas Beltran-Velez
Achille Nazaret
Carolina Zheng
Adrià Garriga-Alonso
Andrew Jesson
Maggie Makar
David M. Blei
229
18
0
16 Oct 2024
From Tokens to Words: On the Inner Lexicon of LLMs
International Conference on Learning Representations (ICLR), 2024
Guy Kaplan
Matanel Oren
Yuval Reif
Roy Schwartz
380
28
0
08 Oct 2024
Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Philipp Mondorf
Sondre Wold
Yun Xue
439
2
0
02 Oct 2024
PEAR: Position-Embedding-Agnostic Attention Re-weighting Enhances Retrieval-Augmented Generation with Zero Inference Overhead
The Web Conference (WWW), 2024
Tao Tan
Yining Qian
Ang Lv
Hongzhan Lin
Songhao Wu
Yongbo Wang
Feng Wang
Jingtong Wu
Xin Lu
Rui Yan
190
3
0
29 Sep 2024
Pay Attention to What Matters
Pedro Luiz Silva
Antonio De Domenico
Ali Maatouk
Fadhel Ayed
ALM
122
1
0
19 Sep 2024
Optimal ablation for interpretability
Neural Information Processing Systems (NeurIPS), 2024
Maximilian Li
Lucas Janson
FAtt
319
11
0
16 Sep 2024
Extracting Paragraphs from LLM Token Activations
Nicholas Pochinkov
Angelo Benoit
Lovkush Agarwal
Zainab Ali Majid
Lucile Ter-Minassian
166
6
0
10 Sep 2024
Representational Analysis of Binding in Language Models
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Qin Dai
Benjamin Heinzerling
Kentaro Inui
301
0
0
09 Sep 2024
Attention Heads of Large Language Models: A Survey
Patterns (Patterns), 2024
Zifan Zheng
Yezhaohui Wang
Yuxin Huang
Shichao Song
Mingchuan Yang
Bo Tang
Feiyu Xiong
Zhiyu Li
LRM
229
61
0
05 Sep 2024
Investigating Neuron Ablation in Attention Heads: The Case for Peak Activation Centering
Nicholas Pochinkov
Ben Pasero
Skylar Shibayama
151
6
0
30 Aug 2024
Can Transformers Do Enumerative Geometry?
International Conference on Learning Representations (ICLR), 2024
Baran Hashemi
Roderic G. Corominas
Alessandro Giacchetto
844
7
0
27 Aug 2024
Multilevel Interpretability Of Artificial Neural Networks: Leveraging Framework And Methods From Neuroscience
Zhonghao He
Jascha Achterberg
Katie Collins
Kevin K. Nejad
Danyal Akarca
...
Chole Li
Kai J. Sandbrink
Stephen Casper
Anna Ivanova
Grace W. Lindsay
AI4CE
249
5
0
22 Aug 2024
The Quest for the Right Mediator: Surveying Mechanistic Interpretability Through the Lens of Causal Mediation Analysis
Computational Linguistics (CL), 2024
Aaron Mueller
Jannik Brinkmann
Millicent Li
Samuel Marks
Koyena Pal
...
Arnab Sen Sharma
Jiuding Sun
Eric Todd
David Bau
Yonatan Belinkov
CML
470
2
0
02 Aug 2024
Penzai + Treescope: A Toolkit for Interpreting, Visualizing, and Editing Models As Data
Mingshu Li
223
6
0
01 Aug 2024
Knowledge Mechanisms in Large Language Models: A Survey and Perspective
Meng Wang
Yunzhi Yao
Ziwen Xu
Shuofei Qiao
Shumin Deng
...
Yong Jiang
Pengjun Xie
Fei Huang
Huajun Chen
Ningyu Zhang
305
58
0
22 Jul 2024
Investigating the Indirect Object Identification circuit in Mamba
Danielle Ensign
Adrià Garriga-Alonso
Mamba
130
0
0
19 Jul 2024
InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
Rohan Gupta
Iván Arcuschin
Thomas Kwa
Adrià Garriga-Alonso
295
5
0
19 Jul 2024
LLM Circuit Analyses Are Consistent Across Training and Scale
Curt Tigges
Michael Hanna
Qinan Yu
Stella Biderman
253
31
0
15 Jul 2024
Transformer Circuit Faithfulness Metrics are not Robust
Joseph Miller
Bilal Chughtai
William Saunders
185
9
0
11 Jul 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
Daking Rai
Yilun Zhou
Shi Feng
Abulhair Saparov
Ziyu Yao
565
79
0
02 Jul 2024
The Remarkable Robustness of LLMs: Stages of Inference?
Vedang Lad
Wes Gurnee
Max Tegmark
Max Tegmark
438
81
0
27 Jun 2024
Finding Transformer Circuits with Edge Pruning
Adithya Bhaskar
Alexander Wettig
Dan Friedman
Danqi Chen
441
33
0
24 Jun 2024
What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian-Noise-free Text-Image Corruption and Evaluation
Michal Golovanevsky
William Rudman
Vedant Palit
Ritambhara Singh
Carsten Eickhoff
414
10
0
24 Jun 2024
Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons
Jianhui Chen
Xiaozhi Wang
Zijun Yao
Yushi Bai
Lei Hou
Juanzi Li
LLMSV
KELM
260
26
0
20 Jun 2024
SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models
Somnath Banerjee
Soham Tripathy
Sayan Layek
Shanu Kumar
Animesh Mukherjee
Rima Hazra
185
12
0
18 Jun 2024
Transcoders Find Interpretable LLM Feature Circuits
Jacob Dunefsky
Philippe Chlenski
Neel Nanda
182
79
0
17 Jun 2024
Evidence of Learned Look-Ahead in a Chess-Playing Neural Network
Erik Jenner
Shreyas Kapur
Vasil Georgiev
Cameron Allen
Scott Emmons
Stuart J. Russell
281
20
0
02 Jun 2024
Exploring and steering the moral compass of Large Language Models
Alejandro Tlaie
LLMSV
218
6
0
27 May 2024
Previous
1
2
3
Next