v1v2 (latest)

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods

International Conference on Learning Representations (ICLR), 2023

27 September 2023

Fred Zhang

Neel Nanda

LLMSV

ArXiv (abs)PDF HTML HuggingFace (4 upvotes)

Papers citing "Towards Best Practices of Activation Patching in Language Models: Metrics and Methods"

50 / 127 papers shown

Title
Combining Causal Models for More Accurate Abstractions of Neural NetworksCLEaR (CLEaR), 2025 Theodora-Mara Pîslar Sara Magliacane Atticus Geiger AI4CE 232 1 0 14 Mar 2025
Implicit Reasoning in Transformers is Reasoning through ShortcutsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025 Tianhe Lin Jian Xie Siyu Yuan Deqing Yang ReLM LRM 374 7 0 10 Mar 2025
(How) Do Language Models Track State? Belinda Z. Li Zifan Carl Guo Jacob Andreas LRM 383 9 0 04 Mar 2025
Neuroplasticity and Corruption in Model Mechanisms: A Case Study Of Indirect Object IdentificationNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025 Vishnu Kabir Chhabra Ding Zhu Mohammad Mahdi Khalili 289 5 0 27 Feb 2025
Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic BiasesAnnual Meeting of the Association for Computational Linguistics (ACL), 2025 Michael Y. Hu Jackson Petty Chuan Shi William Merrill Tal Linzen AI4CE 333 5 0 26 Feb 2025
Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Lucy Farnik Tim Lawson Conor Houghton Laurence Aitchison 293 5 0 25 Feb 2025
Representation Engineering for Large-Language Models: Survey and Research Challenges Lukasz Bartoszcze Sarthak Munshi Bryan Sukidi Jennifer Yen Zejia Yang David Williams-King Linh Le Kosi Asuzu Carsten Maple 348 4 0 24 Feb 2025
Quantifying Logical Consistency in Transformers via Query-Key Alignment Eduard Tulchinskii Anastasia Voznyuk Laida Kushnareva Andrei Andriiainen Irina Piontkovskaya Evgeny Burnaev Serguei Barannikov LRM 284 0 0 24 Feb 2025
The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It Leonardo Bertolazzi Philipp Mondorf Yun Xue Raffaella Bernardi AIFin LRM 433 1 0 17 Feb 2025
Exploring Translation Mechanism of Large Language Models Hongbin Zhang Kehai Chen Xuefeng Bai Xiucheng Li Yang Xiang Min Zhang 342 2 0 17 Feb 2025
Back Attention: Understanding and Enhancing Multi-Hop Reasoning in Large Language Models Zeping Yu Yonatan Belinkov Sophia Ananiadou LRM 206 10 0 15 Feb 2025
Padding Tone: A Mechanistic Analysis of Padding Tokens in T2I ModelsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025 Michael Toker Ido Galil Hadas Orgad Rinon Gal Yoad Tewel Gal Chechik Yonatan Belinkov DiffM 191 5 0 12 Jan 2025
Reversed Attention: On The Gradient Descent Of Attention Layers In GPT Shahar Katz Lior Wolf 100 0 0 22 Dec 2024
Think-to-Talk or Talk-to-Think? When LLMs Come Up with an Answer in Multi-Hop Arithmetic Reasoning Keito Kudo Yoichi Aoki Tatsuki Kuribayashi Shusaku Sone Masaya Taniguchi Ana Brassard Keisuke Sakaguchi Kentaro Inui ReLM LRM 354 0 0 02 Dec 2024
Understanding Multimodal LLMs: the Mechanistic Interpretability of Llava in Visual Question Answering Zeping Yu Sophia Ananiadou 998 8 0 17 Nov 2024
JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit Zeqing He Peng Kuang Zhixuan Chu Huiyu Xu Rui Zheng Kui Ren Chun Chen 352 16 0 17 Nov 2024
Towards Unifying Interpretability and Control: Evaluation via Intervention Usha Bhalla Suraj Srinivas Asma Ghandeharioun Himabindu Lakkaraju 343 17 0 07 Nov 2024
A Implies B: Circuit Analysis in LLMs for Propositional Logical Reasoning Guan Zhe Hong Nishanth Dikkala Enming Luo Cyrus Rashtchian Xin Wang Rina Panigrahy OffRL LRM NAI 344 0 0 06 Nov 2024
Do Mice Grok? Glimpses of Hidden Progress During Overtraining in Sensory Cortex Tanishq Kumar Blake Bordelon Cengiz Pehlevan Venkatesh N. Murthy Samuel Gershman OOD CLL SSL 319 0 0 05 Nov 2024
SPES: Spectrogram Perturbation for Explainable Speech-to-Text Generation Dennis Fucci Marco Gaido Beatrice Savoldi Matteo Negri Mauro Cettolo L. Bentivogli 531 5 0 03 Nov 2024
Abrupt Learning in Transformers: A Case Study on Matrix CompletionNeural Information Processing Systems (NeurIPS), 2024 Pulkit Gopalani Ekdeep Singh Lubana Wei Hu 159 7 0 29 Oct 2024
On the Role of Attention Heads in Large Language Model SafetyInternational Conference on Learning Representations (ICLR), 2024 Zhenhong Zhou Haiyang Yu Xinghua Zhang Rongwu Xu Fei Huang Kun Wang Yang Liu Cunchun Li Yongbin Li 408 36 0 17 Oct 2024
Hypothesis Testing the Circuit Hypothesis in LLMsNeural Information Processing Systems (NeurIPS), 2024 Claudia Shi Nicolas Beltran-Velez Achille Nazaret Carolina Zheng Adrià Garriga-Alonso Andrew Jesson Maggie Makar David M. Blei 229 18 0 16 Oct 2024
From Tokens to Words: On the Inner Lexicon of LLMsInternational Conference on Learning Representations (ICLR), 2024 Guy Kaplan Matanel Oren Yuval Reif Roy Schwartz 380 28 0 08 Oct 2024
Circuit Compositions: Exploring Modular Structures in Transformer-Based Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024 Philipp Mondorf Sondre Wold Yun Xue 439 2 0 02 Oct 2024
PEAR: Position-Embedding-Agnostic Attention Re-weighting Enhances Retrieval-Augmented Generation with Zero Inference OverheadThe Web Conference (WWW), 2024 Tao Tan Yining Qian Ang Lv Hongzhan Lin Songhao Wu Yongbo Wang Feng Wang Jingtong Wu Xin Lu Rui Yan 190 3 0 29 Sep 2024
Pay Attention to What Matters Pedro Luiz Silva Antonio De Domenico Ali Maatouk Fadhel Ayed ALM 122 1 0 19 Sep 2024
Optimal ablation for interpretabilityNeural Information Processing Systems (NeurIPS), 2024 Maximilian Li Lucas Janson FAtt 319 11 0 16 Sep 2024
Extracting Paragraphs from LLM Token Activations Nicholas Pochinkov Angelo Benoit Lovkush Agarwal Zainab Ali Majid Lucile Ter-Minassian 166 6 0 10 Sep 2024
Representational Analysis of Binding in Language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024 Qin Dai Benjamin Heinzerling Kentaro Inui 301 0 0 09 Sep 2024
Attention Heads of Large Language Models: A SurveyPatterns (Patterns), 2024 Zifan Zheng Yezhaohui Wang Yuxin Huang Shichao Song Mingchuan Yang Bo Tang Feiyu Xiong Zhiyu Li LRM 229 61 0 05 Sep 2024
Investigating Neuron Ablation in Attention Heads: The Case for Peak Activation Centering Nicholas Pochinkov Ben Pasero Skylar Shibayama 151 6 0 30 Aug 2024
Can Transformers Do Enumerative Geometry?International Conference on Learning Representations (ICLR), 2024 Baran Hashemi Roderic G. Corominas Alessandro Giacchetto 844 7 0 27 Aug 2024
Multilevel Interpretability Of Artificial Neural Networks: Leveraging Framework And Methods From Neuroscience Zhonghao He Jascha Achterberg Katie Collins Kevin K. Nejad Danyal Akarca ... Chole Li Kai J. Sandbrink Stephen Casper Anna Ivanova Grace W. Lindsay AI4CE 249 5 0 22 Aug 2024
The Quest for the Right Mediator: Surveying Mechanistic Interpretability Through the Lens of Causal Mediation AnalysisComputational Linguistics (CL), 2024 Aaron Mueller Jannik Brinkmann Millicent Li Samuel Marks Koyena Pal ... Arnab Sen Sharma Jiuding Sun Eric Todd David Bau Yonatan Belinkov CML 470 2 0 02 Aug 2024
Penzai + Treescope: A Toolkit for Interpreting, Visualizing, and Editing Models As Data Mingshu Li 223 6 0 01 Aug 2024
Knowledge Mechanisms in Large Language Models: A Survey and Perspective Meng Wang Yunzhi Yao Ziwen Xu Shuofei Qiao Shumin Deng ... Yong Jiang Pengjun Xie Fei Huang Huajun Chen Ningyu Zhang 305 58 0 22 Jul 2024
Investigating the Indirect Object Identification circuit in Mamba Danielle Ensign Adrià Garriga-Alonso Mamba 130 0 0 19 Jul 2024
InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques Rohan Gupta Iván Arcuschin Thomas Kwa Adrià Garriga-Alonso 295 5 0 19 Jul 2024
LLM Circuit Analyses Are Consistent Across Training and Scale Curt Tigges Michael Hanna Qinan Yu Stella Biderman 253 31 0 15 Jul 2024
Transformer Circuit Faithfulness Metrics are not Robust Joseph Miller Bilal Chughtai William Saunders 185 9 0 11 Jul 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 565 79 0 02 Jul 2024
The Remarkable Robustness of LLMs: Stages of Inference? Vedang Lad Wes Gurnee Max Tegmark Max Tegmark 438 81 0 27 Jun 2024
Finding Transformer Circuits with Edge Pruning Adithya Bhaskar Alexander Wettig Dan Friedman Danqi Chen 441 33 0 24 Jun 2024
What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian-Noise-free Text-Image Corruption and Evaluation Michal Golovanevsky William Rudman Vedant Palit Ritambhara Singh Carsten Eickhoff 414 10 0 24 Jun 2024
Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons Jianhui Chen Xiaozhi Wang Zijun Yao Yushi Bai Lei Hou Juanzi Li LLMSV KELM 260 26 0 20 Jun 2024
SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models Somnath Banerjee Soham Tripathy Sayan Layek Shanu Kumar Animesh Mukherjee Rima Hazra 185 12 0 18 Jun 2024
Transcoders Find Interpretable LLM Feature Circuits Jacob Dunefsky Philippe Chlenski Neel Nanda 182 79 0 17 Jun 2024
Evidence of Learned Look-Ahead in a Chess-Playing Neural Network Erik Jenner Shreyas Kapur Vasil Georgiev Cameron Allen Scott Emmons Stuart J. Russell 281 20 0 02 Jun 2024
Exploring and steering the moral compass of Large Language Models Alejandro Tlaie LLMSV 218 6 0 27 May 2024