Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2407.08734
Cited By
Transformer Circuit Faithfulness Metrics are not Robust
11 July 2024
Joseph Miller
Bilal Chughtai
William Saunders
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Transformer Circuit Faithfulness Metrics are not Robust"
11 / 11 papers shown
Title
Scaling sparse feature circuit finding for in-context learning
Dmitrii Kharlapenko
S. Kamath S
Fazl Barez
Arthur Conmy
Neel Nanda
23
0
0
18 Apr 2025
Are formal and functional linguistic mechanisms dissociated in language models?
Michael Hanna
Sandro Pezzelle
Yonatan Belinkov
41
0
0
14 Mar 2025
Mechanistic Unveiling of Transformer Circuits: Self-Influence as a Key to Model Reasoning
L. Zhang
Lijie Hu
Di Wang
LRM
83
0
0
17 Feb 2025
The Representation and Recall of Interwoven Structured Knowledge in LLMs: A Geometric and Layered Analysis
Ge Lei
Samuel J. Cooper
KELM
40
0
0
15 Feb 2025
The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability
Aaron Mueller
Jannik Brinkmann
Millicent Li
Samuel Marks
Koyena Pal
...
Arnab Sen Sharma
Jiuding Sun
Eric Todd
David Bau
Yonatan Belinkov
CML
35
18
0
02 Aug 2024
Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms
Michael Hanna
Sandro Pezzelle
Yonatan Belinkov
51
33
0
26 Mar 2024
AtP*: An efficient and scalable method for localizing LLM behaviour to components
János Kramár
Tom Lieberum
Rohin Shah
Neel Nanda
KELM
41
42
0
01 Mar 2024
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model
Michael Hanna
Ollie Liu
Alexandre Variengien
LRM
178
116
0
30 Apr 2023
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
Atticus Geiger
Zhengxuan Wu
Christopher Potts
Thomas F. Icard
Noah D. Goodman
CML
73
98
0
05 Mar 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
210
486
0
01 Nov 2022
In-context Learning and Induction Heads
Catherine Olsson
Nelson Elhage
Neel Nanda
Nicholas Joseph
Nova Dassarma
...
Tom B. Brown
Jack Clark
Jared Kaplan
Sam McCandlish
C. Olah
237
453
0
24 Sep 2022
1