A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations

6 February 2023

Papers citing "A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations"

15 / 15 papers shown

Title
Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii Kola Ayonrinde Louis Jaburi XAI 67 1 0 02 May 2025
A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i Kola Ayonrinde Louis Jaburi MILM 76 1 0 01 May 2025
Shared Global and Local Geometry of Language Model Embeddings Andrew Lee Melanie Weber F. Viégas Martin Wattenberg FedML 71 1 0 27 Mar 2025
Implicit Reasoning in Transformers is Reasoning through Shortcuts Tianhe Lin Jian Xie Siyu Yuan Deqing Yang ReLM LRM 64 2 0 10 Mar 2025
Relative Representations: Topological and Geometric Perspectives Alejandro García-Castellanos G. Marchetti Danica Kragic Martina Scolamiero 48 0 0 17 Sep 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 70 18 0 02 Jul 2024
What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian-Noise-free Text-Image Corruption and Evaluation Michal Golovanevsky William Rudman Vedant Palit Ritambhara Singh Carsten Eickhoff 31 1 0 24 Jun 2024
Survival of the Fittest Representation: A Case Study with Modular Addition Xiaoman Delores Ding Zifan Carl Guo Eric J. Michaud Ziming Liu Max Tegmark 29 3 0 27 May 2024
Compositional Capabilities of Autoregressive Transformers: A Study on Synthetic, Interpretable Tasks Rahul Ramesh Ekdeep Singh Lubana Mikail Khona Robert P. Dick Hidenori Tanaka CoGe 22 6 0 21 Nov 2023
Uncovering Intermediate Variables in Transformers using Circuit Probing Michael A. Lepori Thomas Serre Ellie Pavlick 70 7 0 07 Nov 2023
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods Fred Zhang Neel Nanda LLMSV 21 95 0 27 Sep 2023
Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP Vedant Palit Rohan Pandey Aryaman Arora Paul Pu Liang 16 20 0 27 Aug 2023
Localizing Model Behavior with Path Patching Nicholas W. Goldowsky-Dill Chris MacLeod L. Sato Aryaman Arora 8 84 0 12 Apr 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 210 486 0 01 Nov 2022
In-context Learning and Induction Heads Catherine Olsson Nelson Elhage Neel Nanda Nicholas Joseph Nova Dassarma ... Tom B. Brown Jack Clark Jared Kaplan Sam McCandlish C. Olah 240 453 0 24 Sep 2022