Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2309.16042
Cited By
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
27 September 2023
Fred Zhang
Neel Nanda
LLMSV
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Towards Best Practices of Activation Patching in Language Models: Metrics and Methods"
50 / 86 papers shown
Title
Self-Ablating Transformers: More Interpretability, Less Sparsity
Jeremias Ferrao
Luhan Mikaelson
Keenan Pepper
Natalia Perez-Campanero Antolin
MILM
16
0
0
01 May 2025
Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition
Zhengfu He
J. Wang
Rui Lin
Xuyang Ge
Wentao Shu
Qiong Tang
J. Zhang
Xipeng Qiu
68
0
0
29 Apr 2025
Functional Abstraction of Knowledge Recall in Large Language Models
Zijian Wang
Chang Xu
KELM
22
0
0
20 Apr 2025
MIB: A Mechanistic Interpretability Benchmark
Aaron Mueller
Atticus Geiger
Sarah Wiegreffe
Dana Arad
Iván Arcuschin
...
Alessandro Stolfo
Martin Tutek
Amir Zur
David Bau
Yonatan Belinkov
33
1
0
17 Apr 2025
How do Large Language Models Understand Relevance? A Mechanistic Interpretability Perspective
Qi Liu
Jiaxin Mao
Ji-Rong Wen
LRM
19
0
0
10 Apr 2025
Combining Causal Models for More Accurate Abstractions of Neural Networks
Theodora-Mara Pîslar
Sara Magliacane
Atticus Geiger
AI4CE
43
0
0
14 Mar 2025
Implicit Reasoning in Transformers is Reasoning through Shortcuts
Tianhe Lin
Jian Xie
Siyu Yuan
Deqing Yang
ReLM
LRM
62
2
0
10 Mar 2025
(How) Do Language Models Track State?
Belinda Z. Li
Zifan Carl Guo
Jacob Andreas
LRM
44
0
0
04 Mar 2025
Neuroplasticity and Corruption in Model Mechanisms: A Case Study Of Indirect Object Identification
Vishnu Kabir Chhabra
Ding Zhu
Mohammad Mahdi Khalili
37
2
0
27 Feb 2025
Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases
Michael Y. Hu
Jackson Petty
Chuan Shi
William Merrill
Tal Linzen
AI4CE
54
1
0
26 Feb 2025
Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations
Lucy Farnik
Tim Lawson
Conor Houghton
Laurence Aitchison
52
0
0
25 Feb 2025
Quantifying Logical Consistency in Transformers via Query-Key Alignment
Eduard Tulchinskii
Anastasia Voznyuk
Laida Kushnareva
Andrei Andriiainen
Irina Piontkovskaya
Evgeny Burnaev
Serguei Barannikov
LRM
59
0
0
24 Feb 2025
Representation Engineering for Large-Language Models: Survey and Research Challenges
Lukasz Bartoszcze
Sarthak Munshi
Bryan Sukidi
Jennifer Yen
Zejia Yang
David Williams-King
Linh Le
Kosi Asuzu
Carsten Maple
98
0
0
24 Feb 2025
Exploring Translation Mechanism of Large Language Models
Hongbin Zhang
Kehai Chen
Xuefeng Bai
Xiucheng Li
Yang Xiang
Min Zhang
47
1
0
17 Feb 2025
The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It
Leonardo Bertolazzi
Philipp Mondorf
Barbara Plank
Raffaella Bernardi
AIFin
LRM
59
0
0
17 Feb 2025
Padding Tone: A Mechanistic Analysis of Padding Tokens in T2I Models
Michael Toker
Ido Galil
Hadas Orgad
Rinon Gal
Yoad Tewel
Gal Chechik
Yonatan Belinkov
DiffM
49
2
0
12 Jan 2025
Think-to-Talk or Talk-to-Think? When LLMs Come Up with an Answer in Multi-Step Arithmetic Reasoning
Keito Kudo
Yoichi Aoki
Tatsuki Kuribayashi
Shusaku Sone
Masaya Taniguchi
Ana Brassard
Keisuke Sakaguchi
Kentaro Inui
ReLM
LRM
69
0
0
02 Dec 2024
Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering
Zeping Yu
Sophia Ananiadou
40
0
0
17 Nov 2024
JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit
Zeqing He
Zhibo Wang
Zhixuan Chu
Huiyu Xu
Rui Zheng
Kui Ren
Chun Chen
36
3
0
17 Nov 2024
Towards Unifying Interpretability and Control: Evaluation via Intervention
Usha Bhalla
Suraj Srinivas
Asma Ghandeharioun
Himabindu Lakkaraju
22
5
0
07 Nov 2024
How Transformers Solve Propositional Logic Problems: A Mechanistic Analysis
Guan Zhe Hong
Nishanth Dikkala
Enming Luo
Cyrus Rashtchian
Xin Wang
Rina Panigrahy
OffRL
LRM
NAI
19
0
0
06 Nov 2024
Do Mice Grok? Glimpses of Hidden Progress During Overtraining in Sensory Cortex
Tanishq Kumar
Blake Bordelon
C. Pehlevan
Venkatesh N. Murthy
Samuel Gershman
OOD
CLL
SSL
33
0
0
05 Nov 2024
SPES: Spectrogram Perturbation for Explainable Speech-to-Text Generation
Dennis Fucci
Marco Gaido
Beatrice Savoldi
Matteo Negri
Mauro Cettolo
L. Bentivogli
38
1
0
03 Nov 2024
Abrupt Learning in Transformers: A Case Study on Matrix Completion
Pulkit Gopalani
Ekdeep Singh Lubana
Wei Hu
27
3
0
29 Oct 2024
On the Role of Attention Heads in Large Language Model Safety
Z. Zhou
Haiyang Yu
Xinghua Zhang
Rongwu Xu
Fei Huang
Kun Wang
Yang Liu
Junfeng Fang
Yongbin Li
38
5
0
17 Oct 2024
Hypothesis Testing the Circuit Hypothesis in LLMs
Claudia Shi
Nicolas Beltran-Velez
Achille Nazaret
Carolina Zheng
Adrià Garriga-Alonso
Andrew Jesson
Maggie Makar
David M. Blei
29
6
0
16 Oct 2024
From Tokens to Words: On the Inner Lexicon of LLMs
Guy Kaplan
Matanel Oren
Yuval Reif
Roy Schwartz
32
12
0
08 Oct 2024
Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models
Philipp Mondorf
Sondre Wold
Barbara Plank
29
0
0
02 Oct 2024
PEAR: Position-Embedding-Agnostic Attention Re-weighting Enhances Retrieval-Augmented Generation with Zero Inference Overhead
Tao Tan
Yining Qian
Ang Lv
Hongzhan Lin
Songhao Wu
Yongbo Wang
Feng Wang
Jingtong Wu
Xin Lu
Rui Yan
17
1
0
29 Sep 2024
Pay Attention to What Matters
Pedro Luiz Silva
Antonio De Domenico
Ali Maatouk
Fadhel Ayed
ALM
17
0
0
19 Sep 2024
Optimal ablation for interpretability
Maximilian Li
Lucas Janson
FAtt
39
2
0
16 Sep 2024
Extracting Paragraphs from LLM Token Activations
Nicholas Pochinkov
Angelo Benoit
Lovkush Agarwal
Zainab Ali Majid
Lucile Ter-Minassian
12
1
0
10 Sep 2024
Representational Analysis of Binding in Language Models
Qin Dai
Benjamin Heinzerling
Kentaro Inui
21
1
0
09 Sep 2024
Attention Heads of Large Language Models: A Survey
Zifan Zheng
Yezhaohui Wang
Yuxin Huang
Shichao Song
Mingchuan Yang
Bo Tang
Feiyu Xiong
Zhiyu Li
LRM
41
1
0
05 Sep 2024
Investigating Neuron Ablation in Attention Heads: The Case for Peak Activation Centering
Nicholas Pochinkov
Ben Pasero
Skylar Shibayama
14
0
0
30 Aug 2024
Can Transformers Do Enumerative Geometry?
Baran Hashemi
Roderic G. Corominas
Alessandro Giacchetto
32
2
0
27 Aug 2024
Multilevel Interpretability Of Artificial Neural Networks: Leveraging Framework And Methods From Neuroscience
Zhonghao He
Jascha Achterberg
Katie Collins
Kevin K. Nejad
Danyal Akarca
...
Chole Li
Kai J. Sandbrink
Stephen Casper
Anna Ivanova
Grace W. Lindsay
AI4CE
18
1
0
22 Aug 2024
The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability
Aaron Mueller
Jannik Brinkmann
Millicent Li
Samuel Marks
Koyena Pal
...
Arnab Sen Sharma
Jiuding Sun
Eric Todd
David Bau
Yonatan Belinkov
CML
33
18
0
02 Aug 2024
Penzai + Treescope: A Toolkit for Interpreting, Visualizing, and Editing Models As Data
Mingshu Li
16
3
0
01 Aug 2024
Knowledge Mechanisms in Large Language Models: A Survey and Perspective
Meng Wang
Yunzhi Yao
Ziwen Xu
Shuofei Qiao
Shumin Deng
...
Yong-jia Jiang
Pengjun Xie
Fei Huang
Huajun Chen
Ningyu Zhang
39
1
0
22 Jul 2024
InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
Rohan Gupta
Iván Arcuschin
Thomas Kwa
Adrià Garriga-Alonso
32
2
0
19 Jul 2024
Investigating the Indirect Object Identification circuit in Mamba
Danielle Ensign
Adrià Garriga-Alonso
Mamba
16
0
0
19 Jul 2024
LLM Circuit Analyses Are Consistent Across Training and Scale
Curt Tigges
Michael Hanna
Qinan Yu
Stella Biderman
18
10
0
15 Jul 2024
Transformer Circuit Faithfulness Metrics are not Robust
Joseph Miller
Bilal Chughtai
William Saunders
32
7
0
11 Jul 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
Daking Rai
Yilun Zhou
Shi Feng
Abulhair Saparov
Ziyu Yao
49
18
0
02 Jul 2024
The Remarkable Robustness of LLMs: Stages of Inference?
Vedang Lad
Wes Gurnee
Max Tegmark
20
33
0
27 Jun 2024
What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian-Noise-free Text-Image Corruption and Evaluation
Michal Golovanevsky
William Rudman
Vedant Palit
Ritambhara Singh
Carsten Eickhoff
24
1
0
24 Jun 2024
Finding Transformer Circuits with Edge Pruning
Adithya Bhaskar
Alexander Wettig
Dan Friedman
Danqi Chen
44
14
0
24 Jun 2024
Finding Safety Neurons in Large Language Models
Jianhui Chen
Xiaozhi Wang
Zijun Yao
Yushi Bai
Lei Hou
Juanzi Li
KELM
LLMSV
45
11
0
20 Jun 2024
SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models
Somnath Banerjee
Soham Tripathy
Sayan Layek
Shanu Kumar
Animesh Mukherjee
Rima Hazra
17
1
0
18 Jun 2024
1
2
Next