ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2309.16042
  4. Cited By
Towards Best Practices of Activation Patching in Language Models:
  Metrics and Methods
v1v2 (latest)

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods

International Conference on Learning Representations (ICLR), 2023
27 September 2023
Fred Zhang
Neel Nanda
    LLMSV
ArXiv (abs)PDFHTMLHuggingFace (4 upvotes)

Papers citing "Towards Best Practices of Activation Patching in Language Models: Metrics and Methods"

27 / 127 papers shown
Title
No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning
  Attacks
No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks
Chak Tou Leong
Yi Cheng
Kaishuai Xu
Jian Wang
Hanlin Wang
Wenjie Li
AAML
332
28
0
25 May 2024
Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to
  the Edge of Generalization
Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
Boshi Wang
Xiang Yue
Yu-Chuan Su
Huan Sun
LRM
328
72
0
23 May 2024
Sparse Autoencoders Enable Scalable and Reliable Circuit Identification
  in Language Models
Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models
Charles OÑeill
Thang Bui
178
12
0
21 May 2024
A Philosophical Introduction to Language Models - Part II: The Way
  Forward
A Philosophical Introduction to Language Models - Part II: The Way Forward
Raphael Milliere
Cameron Buckner
LRM
238
24
0
06 May 2024
How to use and interpret activation patching
How to use and interpret activation patching
Stefan Heimersheim
Neel Nanda
208
93
0
23 Apr 2024
Mechanistic Interpretability for AI Safety -- A Review
Mechanistic Interpretability for AI Safety -- A Review
Leonard Bereska
E. Gavves
AI4CE
316
288
0
22 Apr 2024
Decomposing and Editing Predictions by Modeling Model Computation
Decomposing and Editing Predictions by Modeling Model Computation
Harshay Shah
Andrew Ilyas
Aleksander Madry
KELM
270
23
0
17 Apr 2024
Finding Visual Task Vectors
Finding Visual Task Vectors
Alberto Hojel
Yutong Bai
Trevor Darrell
Amir Globerson
Amir Bar
228
14
0
08 Apr 2024
Locating and Editing Factual Associations in Mamba
Locating and Editing Factual Associations in Mamba
Arnab Sen Sharma
David Atkinson
David Bau
KELM
210
37
0
04 Apr 2024
Unveiling LLMs: The Evolution of Latent Representations in a Temporal
  Knowledge Graph
Unveiling LLMs: The Evolution of Latent Representations in a Temporal Knowledge Graph
Marco Bronzini
Carlo Nicolini
Bruno Lepri
Jacopo Staiano
Baptiste Caramiaux
KELM
164
0
0
04 Apr 2024
On Large Language Models' Hallucination with Regard to Known Facts
On Large Language Models' Hallucination with Regard to Known Facts
Che Jiang
Biqing Qi
Xiangyu Hong
Dayuan Fu
Yang Cheng
Fandong Meng
Mo Yu
Bowen Zhou
Jie Zhou
HILMLRM
236
42
0
29 Mar 2024
Localizing Paragraph Memorization in Language Models
Localizing Paragraph Memorization in Language Models
Niklas Stoehr
Mitchell Gordon
Chiyuan Zhang
Owen Lewis
MU
179
24
0
28 Mar 2024
Interpreting Key Mechanisms of Factual Recall in Transformer-Based
  Language Models
Interpreting Key Mechanisms of Factual Recall in Transformer-Based Language Models
Ang Lv
Yuhan Chen
Kaiyi Zhang
Yulong Wang
Lifeng Liu
Ji-Rong Wen
Jian Xie
Rui Yan
KELM
273
23
0
28 Mar 2024
Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding
  Model Mechanisms
Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms
Michael Hanna
Sandro Pezzelle
Yonatan Belinkov
264
76
0
26 Mar 2024
Monotonic Representation of Numeric Properties in Language Models
Monotonic Representation of Numeric Properties in Language Models
Benjamin Heinzerling
Kentaro Inui
KELMMILM
204
12
0
15 Mar 2024
Diffusion Lens: Interpreting Text Encoders in Text-to-Image Pipelines
Diffusion Lens: Interpreting Text Encoders in Text-to-Image PipelinesAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Michael Toker
Hadas Orgad
Mor Ventura
Dana Arad
Yonatan Belinkov
DiffM
253
20
0
09 Mar 2024
The Heuristic Core: Understanding Subnetwork Generalization in
  Pretrained Language Models
The Heuristic Core: Understanding Subnetwork Generalization in Pretrained Language Models
Adithya Bhaskar
Dan Friedman
Danqi Chen
337
9
0
06 Mar 2024
How to think step-by-step: A mechanistic understanding of
  chain-of-thought reasoning
How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning
Subhabrata Dutta
Joykirat Singh
Soumen Chakrabarti
Tanmoy Chakraborty
LRM
177
47
0
28 Feb 2024
Cutting Off the Head Ends the Conflict: A Mechanism for Interpreting and
  Mitigating Knowledge Conflicts in Language Models
Cutting Off the Head Ends the Conflict: A Mechanism for Interpreting and Mitigating Knowledge Conflicts in Language Models
Zhuoran Jin
Pengfei Cao
Hongbang Yuan
Yubo Chen
Jiexin Xu
Huaijun Li
Xiaojian Jiang
Kang Liu
Jun Zhao
497
68
0
28 Feb 2024
Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic
  Interpretability: A Case Study on Othello-GPT
Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT
Zhengfu He
Xuyang Ge
Qiong Tang
Tianxiang Sun
Qinyuan Cheng
Xipeng Qiu
199
25
0
19 Feb 2024
Learning Interpretable Concepts: Unifying Causal Representation Learning
  and Foundation Models
Learning Interpretable Concepts: Unifying Causal Representation Learning and Foundation Models
Goutham Rajendran
Simon Buchholz
Bryon Aragam
Bernhard Schölkopf
Pradeep Ravikumar
AI4CE
374
29
0
14 Feb 2024
Patchscopes: A Unifying Framework for Inspecting Hidden Representations
  of Language Models
Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language ModelsInternational Conference on Machine Learning (ICML), 2024
Asma Ghandeharioun
Avi Caciularu
Adam Pearce
Lucas Dixon
Mor Geva
599
156
0
11 Jan 2024
Neuron-Level Knowledge Attribution in Large Language Models
Neuron-Level Knowledge Attribution in Large Language Models
Zeping Yu
Sophia Ananiadou
FAttKELM
250
28
0
19 Dec 2023
Forbidden Facts: An Investigation of Competing Objectives in Llama-2
Forbidden Facts: An Investigation of Competing Objectives in Llama-2
Tony T. Wang
Miles Wang
Kaivu Hariharan
Nir Shavit
139
2
0
14 Dec 2023
An Adversarial Example for Direct Logit Attribution: Memory Management
  in gelu-4l
An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4lBlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackboxNLP), 2023
James Dao
Yeu-Tong Lau
Can Rager
Jett Janiak
315
5
0
11 Oct 2023
Polysemanticity and Capacity in Neural Networks
Polysemanticity and Capacity in Neural Networks
Adam Scherlis
Kshitij Sachan
Adam Jermyn
Joe Benton
Buck Shlegeris
MILM
528
48
0
04 Oct 2022
Discovering the Compositional Structure of Vector Representations with
  Role Learning Networks
Discovering the Compositional Structure of Vector Representations with Role Learning NetworksBlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackBoxNLP), 2019
Paul Soulos
R. Thomas McCoy
Tal Linzen
P. Smolensky
CoGe
331
46
0
21 Oct 2019
Previous
123