Tracr: Compiled Transformers as a Laboratory for Interpretability

Tracr: Compiled Transformers as a Laboratory for Interpretability

12 January 2023

Sebastian Farquhar

Vladimir Mikulik

Papers citing "Tracr: Compiled Transformers as a Laboratory for Interpretability"

18 / 18 papers shown

Title
Looped ReLU MLPs May Be All You Need as Practical Programmable Computers Yingyu Liang Zhizhou Sha Zhenmei Shi Zhao-quan Song Yufa Zhou 89 18 0 21 Feb 2025
On the Role of Attention Heads in Large Language Model Safety Z. Zhou Haiyang Yu Xinghua Zhang Rongwu Xu Fei Huang Kun Wang Yang Liu Junfeng Fang Yongbin Li 57 5 0 17 Oct 2024
Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models Philipp Mondorf Sondre Wold Barbara Plank 29 0 0 02 Oct 2024
Representing Rule-based Chatbots with Transformers Dan Friedman Abhishek Panigrahi Danqi Chen 59 1 0 15 Jul 2024
Transformer Circuit Faithfulness Metrics are not Robust Joseph Miller Bilal Chughtai William Saunders 45 7 0 11 Jul 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 75 18 0 02 Jul 2024
Finding Transformer Circuits with Edge Pruning Adithya Bhaskar Alexander Wettig Dan Friedman Danqi Chen 58 14 0 24 Jun 2024
Separations in the Representational Capabilities of Transformers and Recurrent Architectures S. Bhattamishra Michael Hahn Phil Blunsom Varun Kanade GNN 28 9 0 13 Jun 2024
Counting Like Transformers: Compiling Temporal Counting Logic Into Softmax Transformers Andy Yang David Chiang 31 7 0 05 Apr 2024
Opening the AI black box: program synthesis via mechanistic interpretability Eric J. Michaud Isaac Liao Vedang Lad Ziming Liu Anish Mudide Chloe Loughridge Zifan Carl Guo Tara Rezaei Kheirkhah Mateja Vukelić Max Tegmark 23 12 0 07 Feb 2024
3VL: Using Trees to Improve Vision-Language Models' Interpretability Nir Yellinek Leonid Karlinsky Raja Giryes CoGe VLM 49 4 0 28 Dec 2023
Localizing Model Behavior with Path Patching Nicholas W. Goldowsky-Dill Chris MacLeod L. Sato Aryaman Arora 8 85 0 12 Apr 2023
Tighter Bounds on the Expressivity of Transformer Encoders David Chiang Peter A. Cholak A. Pillay 24 53 0 25 Jan 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 210 491 0 01 Nov 2022
Polysemanticity and Capacity in Neural Networks Adam Scherlis Kshitij Sachan Adam Jermyn Joe Benton Buck Shlegeris MILM 133 25 0 04 Oct 2022
In-context Learning and Induction Heads Catherine Olsson Nelson Elhage Neel Nanda Nicholas Joseph Nova Dassarma ... Tom B. Brown Jack Clark Jared Kaplan Sam McCandlish C. Olah 240 456 0 24 Sep 2022
Toy Models of Superposition Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer T. Henighan ... Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg C. Olah AAML MILM 120 316 0 21 Sep 2022
Probing Classifiers: Promises, Shortcomings, and Advances Yonatan Belinkov 221 402 0 24 Feb 2021