ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2309.16042
  4. Cited By
Towards Best Practices of Activation Patching in Language Models:
  Metrics and Methods
v1v2 (latest)

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods

International Conference on Learning Representations (ICLR), 2023
27 September 2023
Fred Zhang
Neel Nanda
    LLMSV
ArXiv (abs)PDFHTMLHuggingFace (4 upvotes)

Papers citing "Towards Best Practices of Activation Patching in Language Models: Metrics and Methods"

50 / 127 papers shown
Title
No Free Lunch in Language Model Bias Mitigation? Targeted Bias Reduction Can Exacerbate Unmitigated LLM Biases
No Free Lunch in Language Model Bias Mitigation? Targeted Bias Reduction Can Exacerbate Unmitigated LLM Biases
Shireen Chand
Faith Baca
Emilio Ferrara
104
0
0
23 Nov 2025
Understanding Counting Mechanisms in Large Language and Vision-Language Models
Understanding Counting Mechanisms in Large Language and Vision-Language Models
Hosein Hasani
Amirmohammad Izadi
Fatemeh Askari
Mobin Bagherian
Sadegh Mohammadian
Mohammad Izadi
M. Baghshah
68
0
0
21 Nov 2025
Anatomy of an Idiom: Tracing Non-Compositionality in Language Models
Andrew Gomes
166
0
0
20 Nov 2025
BlockCert: Certified Blockwise Extraction of Transformer Mechanisms
BlockCert: Certified Blockwise Extraction of Transformer Mechanisms
Sandro Andric
56
0
0
20 Nov 2025
Training Language Models to Explain Their Own Computations
Training Language Models to Explain Their Own Computations
Belinda Z. Li
Zifan Carl Guo
Vincent Huang
Jacob Steinhardt
Jacob Andreas
LRM
160
3
0
11 Nov 2025
APP: Accelerated Path Patching with Task-Specific Pruning
APP: Accelerated Path Patching with Task-Specific Pruning
Frauke Andersen
William Rudman
Ruochen Zhang
Carsten Eickhoff
52
0
0
07 Nov 2025
Addressing divergent representations from causal interventions on neural networks
Addressing divergent representations from causal interventions on neural networks
Satchel Grant
Simon Jerome Han
Alexa R. Tartaglini
Christopher Potts
CML
433
0
0
06 Nov 2025
LLMs Process Lists With General Filter Heads
LLMs Process Lists With General Filter Heads
Arnab Sen Sharma
Giordano Rogers
Natalie Shapira
David Bau
120
0
0
30 Oct 2025
Explainability of Large Language Models: Opportunities and Challenges toward Generating Trustworthy Explanations
Explainability of Large Language Models: Opportunities and Challenges toward Generating Trustworthy Explanations
Shahin Atakishiyev
H. Babiker
Jiayi Dai
Nawshad Farruque
Teruaki Hayashi
...
Md Abed Rahman
Iain Smith
Mi-Young Kim
Osmar R. Zaïane
Randy Goebel
LRM
137
0
0
20 Oct 2025
How role-play shapes relevance judgment in zero-shot LLM rankers
How role-play shapes relevance judgment in zero-shot LLM rankers
Yumeng Wang
Jirui Qi
Catherine Chen
Panagiotis Eustratiadis
Suzan Verberne
66
0
0
20 Oct 2025
DARTS-GT: Differentiable Architecture Search for Graph Transformers with Quantifiable Instance-Specific Interpretability Analysis
DARTS-GT: Differentiable Architecture Search for Graph Transformers with Quantifiable Instance-Specific Interpretability Analysis
Shruti Sarika Chakraborty
Peter Minary
143
0
0
16 Oct 2025
Analysing Moral Bias in Finetuned LLMs through Mechanistic Interpretability
Analysing Moral Bias in Finetuned LLMs through Mechanistic Interpretability
Bianca Raimondi
Daniela Dalbagno
Maurizio Gabbrielli
AI4CE
49
0
0
14 Oct 2025
The Curious Case of Factual (Mis)Alignment between LLMs' Short- and Long-Form Answers
The Curious Case of Factual (Mis)Alignment between LLMs' Short- and Long-Form Answers
Saad Obaid ul Islam
Anne Lauscher
Goran Glavaš
HILM
174
0
0
13 Oct 2025
Medical Interpretability and Knowledge Maps of Large Language Models
Medical Interpretability and Knowledge Maps of Large Language Models
Razvan Marinescu
Victoria-Elisabeth Gruber
Diego Fajardo
FAttAI4MH
202
0
0
13 Oct 2025
Discursive Circuits: How Do Language Models Understand Discourse Relations?
Discursive Circuits: How Do Language Models Understand Discourse Relations?
Yisong Miao
Min-Yen Kan
136
2
0
13 Oct 2025
Causality $\neq$ Decodability, and Vice Versa: Lessons from Interpreting Counting ViTs
Causality ≠\neq= Decodability, and Vice Versa: Lessons from Interpreting Counting ViTs
Lianghuan Huang
Yingshan Chang
CML
56
0
0
10 Oct 2025
Inverse-Free Wilson Loops for Transformers: A Practical Diagnostic for Invariance and Order Sensitivity
Inverse-Free Wilson Loops for Transformers: A Practical Diagnostic for Invariance and Order Sensitivity
Edward Y. Chang
Ethan Chang
80
1
0
09 Oct 2025
Validation of Various Normalization Methods for Brain Tumor Segmentation: Can Federated Learning Overcome This Heterogeneity?
Validation of Various Normalization Methods for Brain Tumor Segmentation: Can Federated Learning Overcome This Heterogeneity?
Jan Fiszer
Dominika Ciupek
Maciej Malawski
FedML
196
0
0
08 Oct 2025
Reproducing and Extending Causal Insights Into Term Frequency Computation in Neural Rankers
Reproducing and Extending Causal Insights Into Term Frequency Computation in Neural Rankers
Cile van Marken
Roxana Petcu
CML
160
0
0
08 Oct 2025
Mechanistic Interpretability as Statistical Estimation: A Variance Analysis of EAP-IG
Mechanistic Interpretability as Statistical Estimation: A Variance Analysis of EAP-IG
Maxime Méloux
François Portet
Maxime Peyrard
155
1
0
01 Oct 2025
Query Circuits: Explaining How Language Models Answer User Prompts
Query Circuits: Explaining How Language Models Answer User Prompts
Tung-Yu Wu
Fazl Barez
ReLMLRM
121
0
0
29 Sep 2025
Toward Preference-aligned Large Language Models via Residual-based Model Steering
Toward Preference-aligned Large Language Models via Residual-based Model Steering
Lucio La Cava
Andrea Tagarelli
LLMSV
144
0
0
28 Sep 2025
What Matters More For In-Context Learning under Matched Compute Budgets: Pretraining on Natural Text or Incorporating Targeted Synthetic Examples?
What Matters More For In-Context Learning under Matched Compute Budgets: Pretraining on Natural Text or Incorporating Targeted Synthetic Examples?
Mohammed Sabry
Anya Belz
79
0
0
26 Sep 2025
Can Large Language Models Develop Gambling Addiction?
Can Large Language Models Develop Gambling Addiction?
Seungpil Lee
Donghyeon Shin
Yunjeong Lee
Sundong Kim
AIFinAI4CE
154
1
0
26 Sep 2025
How Persuasive is Your Context?
How Persuasive is Your Context?
Tu Nguyen
Kevin Du
Alexander Miserlis Hoyle
Ryan Cotterell
96
0
0
22 Sep 2025
V-SEAM: Visual Semantic Editing and Attention Modulating for Causal Interpretability of Vision-Language Models
V-SEAM: Visual Semantic Editing and Attention Modulating for Causal Interpretability of Vision-Language Models
Qidong Wang
Junjie Hu
Ming Jiang
72
0
0
18 Sep 2025
Statistical Methods in Generative AI
Statistical Methods in Generative AI
Edgar Dobriban
285
3
0
08 Sep 2025
A Review of Developmental Interpretability in Large Language Models
A Review of Developmental Interpretability in Large Language Models
Ihor Kendiukhov
ELM
140
0
0
19 Aug 2025
How Causal Abstraction Underpins Computational Explanation
How Causal Abstraction Underpins Computational Explanation
Atticus Geiger
Jacqueline Harding
Thomas Icard
105
2
0
15 Aug 2025
Understanding and Mitigating Political Stance Cross-topic Generalization in Large Language Models
Understanding and Mitigating Political Stance Cross-topic Generalization in Large Language Models
J. Zhang
Shu Yang
Junchao Wu
Yang Li
Haiyan Zhao
184
1
0
04 Aug 2025
Unveiling the Influence of Amplifying Language-Specific Neurons
Inaya Rahmanisa
Lyzander Marciano Andrylie
Mahardika Krisna Ihsani
Alfan Farizki Wicaksono
Haryo Akbarianto Wibowo
Alham Fikri Aji
122
0
0
30 Jul 2025
Dissecting Persona-Driven Reasoning in Language Models via Activation Patching
Dissecting Persona-Driven Reasoning in Language Models via Activation Patching
Ansh Poonia
Maeghal Jain
177
0
0
28 Jul 2025
Latent Concept Disentanglement in Transformer-based Language Models
Latent Concept Disentanglement in Transformer-based Language Models
Guan Zhe Hong
Bhavya Vasudeva
Willie Neiswanger
Cyrus Rashtchian
Prabhakar Raghavan
Rina Panigrahy
ReLMLRM
315
2
0
20 Jun 2025
Rethinking Explainability in the Era of Multimodal AI
Rethinking Explainability in the Era of Multimodal AI
Chirag Agarwal
195
1
0
16 Jun 2025
Universal Jailbreak Suffixes Are Strong Attention Hijackers
Universal Jailbreak Suffixes Are Strong Attention Hijackers
Matan Ben-Tov
Mor Geva
Mahmood Sharif
152
0
0
15 Jun 2025
Path Channels and Plan Extension Kernels: a Mechanistic Description of Planning in a Sokoban RNN
Path Channels and Plan Extension Kernels: a Mechanistic Description of Planning in a Sokoban RNN
Mohammad Taufeeque
Aaron David Tucker
Adam Gleave
Adrià Garriga-Alonso
246
0
0
11 Jun 2025
Learning Distribution-Wise Control in Representation Space for Language Models
Learning Distribution-Wise Control in Representation Space for Language Models
Chunyuan Deng
Ruidi Chang
Hanjie Chen
239
2
0
07 Jun 2025
Establishing Trustworthy LLM Evaluation via Shortcut Neuron Analysis
Establishing Trustworthy LLM Evaluation via Shortcut Neuron AnalysisAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Kejian Zhu
Shangqing Tu
Zhuoran Jin
Lei Hou
Juanzi Li
Jun Zhao
KELM
195
0
0
04 Jun 2025
Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation
Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation
Ziling Cheng
Meng Cao
Leila Pishdad
Yanshuai Cao
Jackie Chi Kit Cheung
LRM
348
3
0
29 May 2025
An Empirical Study of the Anchoring Effect in LLMs: Existence, Mechanism, and Potential Mitigations
An Empirical Study of the Anchoring Effect in LLMs: Existence, Mechanism, and Potential Mitigations
Yiming Huang
Biquan Bie
Zuqiu Na
Weilin Ruan
Songxin Lei
Yutao Yue
Xinlei He
211
2
0
21 May 2025
Explaining Neural Networks with Reasons
Explaining Neural Networks with Reasons
Levin Hornischer
Hannes Leitgeb
FAttAAMLMILM
287
0
0
20 May 2025
Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations
Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations
Li Ji-An
Hua-Dong Xiong
Robert C. Wilson
Marcelo G. Mattar
M. Benna
287
11
0
19 May 2025
SPIRIT: Patching Speech Language Models against Jailbreak Attacks
SPIRIT: Patching Speech Language Models against Jailbreak Attacks
Amirbek Djanibekov
Nurdaulet Mukhituly
Kentaro Inui
Hanan Aldarmaki
Nils Lukas
AAML
272
1
0
18 May 2025
Rethinking Circuit Completeness in Language Models: AND, OR, and ADDER Gates
Rethinking Circuit Completeness in Language Models: AND, OR, and ADDER Gates
Hang Chen
Jiaying Zhu
Xinyu Yang
Wenya Wang
LRM
316
3
0
15 May 2025
Interpreting Multilingual and Document-Length Sensitive Relevance Computations in Neural Retrieval Models through Axiomatic Causal Interventions
Interpreting Multilingual and Document-Length Sensitive Relevance Computations in Neural Retrieval Models through Axiomatic Causal InterventionsAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2025
Oliver Savolainen
Dur e Najaf Amjad
Roxana Petcu
AAML
171
0
0
04 May 2025
Self-Ablating Transformers: More Interpretability, Less Sparsity
Self-Ablating Transformers: More Interpretability, Less Sparsity
Jeremias Ferrao
Luhan Mikaelson
Keenan Pepper
Natalia Perez-Campanero Antolin
MILM
246
0
0
01 May 2025
Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition
Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition
Zhengfu He
Jiadong Wang
Rui Lin
Xuyang Ge
Wentao Shu
Qiong Tang
J.N. Zhang
Xipeng Qiu
262
1
0
29 Apr 2025
Functional Abstraction of Knowledge Recall in Large Language Models
Functional Abstraction of Knowledge Recall in Large Language Models
Zijian Wang
Chang Xu
KELM
208
1
0
20 Apr 2025
MIB: A Mechanistic Interpretability Benchmark
MIB: A Mechanistic Interpretability Benchmark
Aaron Mueller
Atticus Geiger
Sarah Wiegreffe
Dana Arad
Iván Arcuschin
...
Alessandro Stolfo
Martin Tutek
Amir Zur
David Bau
Yonatan Belinkov
588
7
0
17 Apr 2025
How do Large Language Models Understand Relevance? A Mechanistic Interpretability Perspective
How do Large Language Models Understand Relevance? A Mechanistic Interpretability Perspective
Qi Liu
Jiaxin Mao
Ji-Rong Wen
LRM
246
3
0
10 Apr 2025
123
Next