Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2302.03025
Cited By
A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations
6 February 2023
Bilal Chughtai
Lawrence Chan
Neel Nanda
Re-assign community
ArXiv
PDF
HTML
Papers citing
"A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations"
15 / 15 papers shown
Title
Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii
Kola Ayonrinde
Louis Jaburi
XAI
67
1
0
02 May 2025
A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i
Kola Ayonrinde
Louis Jaburi
MILM
76
1
0
01 May 2025
Shared Global and Local Geometry of Language Model Embeddings
Andrew Lee
Melanie Weber
F. Viégas
Martin Wattenberg
FedML
71
1
0
27 Mar 2025
Implicit Reasoning in Transformers is Reasoning through Shortcuts
Tianhe Lin
Jian Xie
Siyu Yuan
Deqing Yang
ReLM
LRM
64
2
0
10 Mar 2025
Relative Representations: Topological and Geometric Perspectives
Alejandro García-Castellanos
G. Marchetti
Danica Kragic
Martina Scolamiero
48
0
0
17 Sep 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
Daking Rai
Yilun Zhou
Shi Feng
Abulhair Saparov
Ziyu Yao
70
18
0
02 Jul 2024
What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian-Noise-free Text-Image Corruption and Evaluation
Michal Golovanevsky
William Rudman
Vedant Palit
Ritambhara Singh
Carsten Eickhoff
31
1
0
24 Jun 2024
Survival of the Fittest Representation: A Case Study with Modular Addition
Xiaoman Delores Ding
Zifan Carl Guo
Eric J. Michaud
Ziming Liu
Max Tegmark
29
3
0
27 May 2024
Compositional Capabilities of Autoregressive Transformers: A Study on Synthetic, Interpretable Tasks
Rahul Ramesh
Ekdeep Singh Lubana
Mikail Khona
Robert P. Dick
Hidenori Tanaka
CoGe
22
6
0
21 Nov 2023
Uncovering Intermediate Variables in Transformers using Circuit Probing
Michael A. Lepori
Thomas Serre
Ellie Pavlick
70
7
0
07 Nov 2023
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
Fred Zhang
Neel Nanda
LLMSV
21
95
0
27 Sep 2023
Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP
Vedant Palit
Rohan Pandey
Aryaman Arora
Paul Pu Liang
16
20
0
27 Aug 2023
Localizing Model Behavior with Path Patching
Nicholas W. Goldowsky-Dill
Chris MacLeod
L. Sato
Aryaman Arora
8
84
0
12 Apr 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
210
486
0
01 Nov 2022
In-context Learning and Induction Heads
Catherine Olsson
Nelson Elhage
Neel Nanda
Nicholas Joseph
Nova Dassarma
...
Tom B. Brown
Jack Clark
Jared Kaplan
Sam McCandlish
C. Olah
240
453
0
24 Sep 2022
1