Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2311.04354
Cited By
Uncovering Intermediate Variables in Transformers using Circuit Probing
7 November 2023
Michael A. Lepori
Thomas Serre
Ellie Pavlick
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Uncovering Intermediate Variables in Transformers using Circuit Probing"
12 / 12 papers shown
Title
The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability
Aaron Mueller
Jannik Brinkmann
Millicent Li
Samuel Marks
Koyena Pal
...
Arnab Sen Sharma
Jiuding Sun
Eric Todd
David Bau
Yonatan Belinkov
CML
25
18
0
02 Aug 2024
Mechanistically Interpreting a Transformer-based 2-SAT Solver: An Axiomatic Approach
Nils Palumbo
Ravi Mangal
Zifan Wang
Saranya Vijayakumar
Corina S. Pasareanu
Somesh Jha
30
1
0
18 Jul 2024
Position: An Inner Interpretability Framework for AI Inspired by Lessons from Cognitive Neuroscience
Martina G. Vilas
Federico Adolfi
David Poeppel
Gemma Roig
23
5
0
03 Jun 2024
pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
Zhengxuan Wu
Atticus Geiger
Aryaman Arora
Jing-ling Huang
Zheng Wang
Noah D. Goodman
Christopher D. Manning
Christopher Potts
MU
32
25
0
12 Mar 2024
Observable Propagation: Uncovering Feature Vectors in Transformers
Jacob Dunefsky
Arman Cohan
16
1
0
26 Dec 2023
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model
Michael Hanna
Ollie Liu
Alexandre Variengien
LRM
167
116
0
30 Apr 2023
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
Atticus Geiger
Zhengxuan Wu
Christopher Potts
Thomas F. Icard
Noah D. Goodman
CML
73
98
0
05 Mar 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
205
486
0
01 Nov 2022
In-context Learning and Induction Heads
Catherine Olsson
Nelson Elhage
Neel Nanda
Nicholas Joseph
Nova Dassarma
...
Tom B. Brown
Jack Clark
Jared Kaplan
Sam McCandlish
C. Olah
232
453
0
24 Sep 2022
Linear Adversarial Concept Erasure
Shauli Ravfogel
Michael Twiton
Yoav Goldberg
Ryan Cotterell
KELM
62
56
0
28 Jan 2022
Quantifying Local Specialization in Deep Neural Networks
Shlomi Hod
Daniel Filan
Stephen Casper
Andrew Critch
Stuart J. Russell
37
10
0
13 Oct 2021
Probing Classifiers: Promises, Shortcomings, and Advances
Yonatan Belinkov
216
291
0
24 Feb 2021
1