Uncovering Intermediate Variables in Transformers using Circuit Probing

7 November 2023

Papers citing "Uncovering Intermediate Variables in Transformers using Circuit Probing"

12 / 12 papers shown

Title
The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability Aaron Mueller Jannik Brinkmann Millicent Li Samuel Marks Koyena Pal ... Arnab Sen Sharma Jiuding Sun Eric Todd David Bau Yonatan Belinkov CML 25 18 0 02 Aug 2024
Mechanistically Interpreting a Transformer-based 2-SAT Solver: An Axiomatic Approach Nils Palumbo Ravi Mangal Zifan Wang Saranya Vijayakumar Corina S. Pasareanu Somesh Jha 30 1 0 18 Jul 2024
Position: An Inner Interpretability Framework for AI Inspired by Lessons from Cognitive Neuroscience Martina G. Vilas Federico Adolfi David Poeppel Gemma Roig 23 5 0 03 Jun 2024
pyvene: A Library for Understanding and Improving PyTorch Models via Interventions Zhengxuan Wu Atticus Geiger Aryaman Arora Jing-ling Huang Zheng Wang Noah D. Goodman Christopher D. Manning Christopher Potts MU 32 25 0 12 Mar 2024
Observable Propagation: Uncovering Feature Vectors in Transformers Jacob Dunefsky Arman Cohan 16 1 0 26 Dec 2023
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model Michael Hanna Ollie Liu Alexandre Variengien LRM 167 116 0 30 Apr 2023
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations Atticus Geiger Zhengxuan Wu Christopher Potts Thomas F. Icard Noah D. Goodman CML 73 98 0 05 Mar 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 205 486 0 01 Nov 2022
In-context Learning and Induction Heads Catherine Olsson Nelson Elhage Neel Nanda Nicholas Joseph Nova Dassarma ... Tom B. Brown Jack Clark Jared Kaplan Sam McCandlish C. Olah 232 453 0 24 Sep 2022
Linear Adversarial Concept Erasure Shauli Ravfogel Michael Twiton Yoav Goldberg Ryan Cotterell KELM 62 56 0 28 Jan 2022
Quantifying Local Specialization in Deep Neural Networks Shlomi Hod Daniel Filan Stephen Casper Andrew Critch Stuart J. Russell 37 10 0 13 Oct 2021
Probing Classifiers: Promises, Shortcomings, and Advances Yonatan Belinkov 216 291 0 24 Feb 2021