v1v2 (latest)

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods

International Conference on Learning Representations (ICLR), 2023

27 September 2023

Fred Zhang

Neel Nanda

LLMSV

ArXiv (abs)PDF HTML HuggingFace (4 upvotes)

Papers citing "Towards Best Practices of Activation Patching in Language Models: Metrics and Methods"

50 / 128 papers shown

No Free Lunch in Language Model Bias Mitigation? Targeted Bias Reduction Can Exacerbate Unmitigated LLM Biases

Shireen Chand

Faith Baca

Emilio Ferrara

124

23 Nov 2025

Understanding Counting Mechanisms in Large Language and Vision-Language Models

21 Nov 2025

BlockCert: Certified Blockwise Extraction of Transformer Mechanisms

Sandro Andric

20 Nov 2025

Anatomy of an Idiom: Tracing Non-Compositionality in Language Models

Andrew Gomes

185

20 Nov 2025

Training Language Models to Explain Their Own Computations

219

11 Nov 2025

APP: Accelerated Path Patching with Task-Specific Pruning

07 Nov 2025

Addressing divergent representations from causal interventions on neural networks

464

06 Nov 2025

LLMs Process Lists With General Filter Heads

154

30 Oct 2025

How role-play shapes relevance judgment in zero-shot LLM rankers

Yumeng Wang

Jirui Qi

Catherine Chen

Panagiotis Eustratiadis

Suzan Verberne

20 Oct 2025

Explainability of Large Language Models: Opportunities and Challenges toward Generating Trustworthy Explanations

...

165

20 Oct 2025

DARTS-GT: Differentiable Architecture Search for Graph Transformers with Quantifiable Instance-Specific Interpretability Analysis

Shruti Sarika Chakraborty

Peter Minary

188

16 Oct 2025

Analysing Moral Bias in Finetuned LLMs through Mechanistic Interpretability

14 Oct 2025

Medical Interpretability and Knowledge Maps of Large Language Models

Razvan Marinescu

Victoria-Elisabeth Gruber

Diego Fajardo

FAtt AI4MH

238

13 Oct 2025

Discursive Circuits: How Do Language Models Understand Discourse Relations?

Yisong Miao

Min-Yen Kan

143

13 Oct 2025

The Curious Case of Factual (Mis)Alignment between LLMs' Short- and Long-Form Answers

212

13 Oct 2025

$Causality $\neq$ Decodability, and Vice Versa: Lessons from Interpreting Counting ViTs$

Causality

\neq

Decodability, and Vice Versa: Lessons from Interpreting Counting ViTs

Lianghuan Huang

Yingshan Chang

CML

10 Oct 2025

Inverse-Free Wilson Loops for Transformers: A Practical Diagnostic for Invariance and Order Sensitivity

Edward Y. Chang

Ethan Chang

09 Oct 2025

Validation of Various Normalization Methods for Brain Tumor Segmentation: Can Federated Learning Overcome This Heterogeneity?

212

08 Oct 2025

Reproducing and Extending Causal Insights Into Term Frequency Computation in Neural Rankers

Cile van Marken

Roxana Petcu

CML

172

08 Oct 2025

Mechanistic Interpretability as Statistical Estimation: A Variance Analysis of EAP-IG

Maxime Méloux

François Portet

Maxime Peyrard

166

01 Oct 2025

Query Circuits: Explaining How Language Models Answer User Prompts

Tung-Yu Wu

Fazl Barez

ReLM LRM

154

29 Sep 2025

Toward Preference-aligned Large Language Models via Residual-based Model Steering

Lucio La Cava

Andrea Tagarelli

LLMSV

162

28 Sep 2025

What Matters More For In-Context Learning under Matched Compute Budgets: Pretraining on Natural Text or Incorporating Targeted Synthetic Examples?

Mohammed Sabry

Anya Belz

26 Sep 2025

Can Large Language Models Develop Gambling Addiction?

240

26 Sep 2025

How Persuasive is Your Context?

Tu Nguyen

Kevin Du

Alexander Miserlis Hoyle

Ryan Cotterell

112

22 Sep 2025

V-SEAM: Visual Semantic Editing and Attention Modulating for Causal Interpretability of Vision-Language Models

Qidong Wang

Junjie Hu

Ming Jiang

101

18 Sep 2025

Statistical Methods in Generative AI

Edgar Dobriban

289

08 Sep 2025

A Review of Developmental Interpretability in Large Language Models

Ihor Kendiukhov

ELM

204

19 Aug 2025

How Causal Abstraction Underpins Computational Explanation

Atticus Geiger

Jacqueline Harding

Thomas Icard

141

15 Aug 2025

Understanding and Mitigating Political Stance Cross-topic Generalization in Large Language Models

208

04 Aug 2025

Unveiling the Influence of Amplifying Language-Specific Neurons

Inaya Rahmanisa

Lyzander Marciano Andrylie

Mahardika Krisna Ihsani

Alfan Farizki Wicaksono

Haryo Akbarianto Wibowo

Alham Fikri Aji

141

30 Jul 2025

Dissecting Persona-Driven Reasoning in Language Models via Activation Patching

Ansh Poonia

Maeghal Jain

211

28 Jul 2025

Latent Concept Disentanglement in Transformer-based Language Models

336

20 Jun 2025

Rethinking Explainability in the Era of Multimodal AI

Chirag Agarwal

230

16 Jun 2025

Universal Jailbreak Suffixes Are Strong Attention Hijackers

Matan Ben-Tov

Mor Geva

Mahmood Sharif

207

15 Jun 2025

Path Channels and Plan Extension Kernels: a Mechanistic Description of Planning in a Sokoban RNN

291

11 Jun 2025

Learning Distribution-Wise Control in Representation Space for Language Models

Chunyuan Deng

Ruidi Chang

Hanjie Chen

267

07 Jun 2025

Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective

Bhavik Chandna

Zubair Bashir

Procheta Sen

280

05 Jun 2025

Establishing Trustworthy LLM Evaluation via Shortcut Neuron AnalysisAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

223

04 Jun 2025

Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation

Jackie Chi Kit Cheung

LRM

353

29 May 2025

An Empirical Study of the Anchoring Effect in LLMs: Existence, Mechanism, and Potential Mitigations

225

21 May 2025

Explaining Neural Networks with Reasons

Levin Hornischer

Hannes Leitgeb

FAtt AAML MILM

319

20 May 2025

Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations

354

19 May 2025

SPIRIT: Patching Speech Language Models against Jailbreak Attacks

297

18 May 2025

Rethinking Circuit Completeness in Language Models: AND, OR, and ADDER Gates

334

15 May 2025

Interpreting Multilingual and Document-Length Sensitive Relevance Computations in Neural Retrieval Models through Axiomatic Causal InterventionsAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2025

191

04 May 2025

Self-Ablating Transformers: More Interpretability, Less Sparsity

Jeremias Ferrao

Luhan Mikaelson

Keenan Pepper

Natalia Perez-Campanero Antolin

MILM

270

01 May 2025

Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition

302

29 Apr 2025

Functional Abstraction of Knowledge Recall in Large Language Models

Zijian Wang

Chang Xu

KELM

247

20 Apr 2025

MIB: A Mechanistic Interpretability Benchmark

...

673

17 Apr 2025