ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2304.05969
  4. Cited By
Localizing Model Behavior with Path Patching

Localizing Model Behavior with Path Patching

12 April 2023
Nicholas W. Goldowsky-Dill
Chris MacLeod
L. Sato
Aryaman Arora
ArXivPDFHTML

Papers citing "Localizing Model Behavior with Path Patching"

25 / 75 papers shown
Title
Transformer Mechanisms Mimic Frontostriatal Gating Operations When
  Trained on Human Working Memory Tasks
Transformer Mechanisms Mimic Frontostriatal Gating Operations When Trained on Human Working Memory Tasks
Aaron Traylor
Jack Merullo
Michael J. Frank
Ellie Pavlick
32
6
0
13 Feb 2024
Patchscopes: A Unifying Framework for Inspecting Hidden Representations
  of Language Models
Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models
Asma Ghandeharioun
Avi Caciularu
Adam Pearce
Lucas Dixon
Mor Geva
27
87
0
11 Jan 2024
Observable Propagation: Uncovering Feature Vectors in Transformers
Observable Propagation: Uncovering Feature Vectors in Transformers
Jacob Dunefsky
Arman Cohan
33
2
0
26 Dec 2023
Neuron-Level Knowledge Attribution in Large Language Models
Neuron-Level Knowledge Attribution in Large Language Models
Zeping Yu
Sophia Ananiadou
FAtt
KELM
19
6
0
19 Dec 2023
Successor Heads: Recurring, Interpretable Attention Heads In The Wild
Successor Heads: Recurring, Interpretable Attention Heads In The Wild
Rhys Gould
Euan Ong
George Ogden
Arthur Conmy
LRM
13
44
0
14 Dec 2023
Look Before You Leap: A Universal Emergent Decomposition of Retrieval
  Tasks in Language Models
Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models
Alexandre Variengien
Eric Winsor
LRM
ReLM
74
10
0
13 Dec 2023
Interpretability Illusions in the Generalization of Simplified Models
Interpretability Illusions in the Generalization of Simplified Models
Dan Friedman
Andrew Kyle Lampinen
Lucas Dixon
Danqi Chen
Asma Ghandeharioun
17
14
0
06 Dec 2023
Do Localization Methods Actually Localize Memorized Data in LLMs? A Tale
  of Two Benchmarks
Do Localization Methods Actually Localize Memorized Data in LLMs? A Tale of Two Benchmarks
Ting-Yun Chang
Jesse Thomason
Robin Jia
15
14
0
15 Nov 2023
Towards Interpretable Sequence Continuation: Analyzing Shared Circuits
  in Large Language Models
Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models
Michael Lan
Phillip H. S. Torr
Fazl Barez
LRM
30
2
0
07 Nov 2023
Circuit Component Reuse Across Tasks in Transformer Language Models
Circuit Component Reuse Across Tasks in Transformer Language Models
Jack Merullo
Carsten Eickhoff
Ellie Pavlick
37
62
0
12 Oct 2023
An Adversarial Example for Direct Logit Attribution: Memory Management
  in gelu-4l
An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l
James Dao
Yeu-Tong Lau
Can Rager
Jett Janiak
35
5
0
11 Oct 2023
The Geometry of Truth: Emergent Linear Structure in Large Language Model
  Representations of True/False Datasets
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
Samuel Marks
Max Tegmark
HILM
102
168
0
10 Oct 2023
Copy Suppression: Comprehensively Understanding an Attention Head
Copy Suppression: Comprehensively Understanding an Attention Head
Callum McDougall
Arthur Conmy
Cody Rushing
Thomas McGrath
Neel Nanda
MILM
23
41
0
06 Oct 2023
Discovering Knowledge-Critical Subnetworks in Pretrained Language Models
Discovering Knowledge-Critical Subnetworks in Pretrained Language Models
Deniz Bayazit
Negar Foroutan
Zeming Chen
Gail Weiss
Antoine Bosselut
KELM
24
13
0
04 Oct 2023
Towards Best Practices of Activation Patching in Language Models:
  Metrics and Methods
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
Fred Zhang
Neel Nanda
LLMSV
28
97
0
27 Sep 2023
Circuit Breaking: Removing Model Behaviors with Targeted Ablation
Circuit Breaking: Removing Model Behaviors with Targeted Ablation
Maximilian Li
Xander Davies
Max Nadeau
KELM
MU
16
27
0
12 Sep 2023
Towards Vision-Language Mechanistic Interpretability: A Causal Tracing
  Tool for BLIP
Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP
Vedant Palit
Rohan Pandey
Aryaman Arora
Paul Pu Liang
26
20
0
27 Aug 2023
Finding Neurons in a Haystack: Case Studies with Sparse Probing
Finding Neurons in a Haystack: Case Studies with Sparse Probing
Wes Gurnee
Neel Nanda
Matthew Pauly
Katherine Harvey
Dmitrii Troitskii
Dimitris Bertsimas
MILM
155
186
0
02 May 2023
How does GPT-2 compute greater-than?: Interpreting mathematical
  abilities in a pre-trained language model
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model
Michael Hanna
Ollie Liu
Alexandre Variengien
LRM
189
119
0
30 Apr 2023
Towards Automated Circuit Discovery for Mechanistic Interpretability
Towards Automated Circuit Discovery for Mechanistic Interpretability
Arthur Conmy
Augustine N. Mavor-Parker
Aengus Lynch
Stefan Heimersheim
Adrià Garriga-Alonso
18
276
0
28 Apr 2023
Computational modeling of semantic change
Computational modeling of semantic change
Nina Tahmasebi
Haim Dubossarsky
26
6
0
13 Apr 2023
Finding Alignments Between Interpretable Causal Variables and
  Distributed Neural Representations
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
Atticus Geiger
Zhengxuan Wu
Christopher Potts
Thomas F. Icard
Noah D. Goodman
CML
73
98
0
05 Mar 2023
Interpretability in the Wild: a Circuit for Indirect Object
  Identification in GPT-2 small
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
212
494
0
01 Nov 2022
Scale Efficiently: Insights from Pre-training and Fine-tuning
  Transformers
Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers
Yi Tay
Mostafa Dehghani
J. Rao
W. Fedus
Samira Abnar
Hyung Won Chung
Sharan Narang
Dani Yogatama
Ashish Vaswani
Donald Metzler
198
110
0
22 Sep 2021
Shortformer: Better Language Modeling using Shorter Inputs
Shortformer: Better Language Modeling using Shorter Inputs
Ofir Press
Noah A. Smith
M. Lewis
219
89
0
31 Dec 2020
Previous
12