LLM Circuit Analyses Are Consistent Across Training and Scale

LLM Circuit Analyses Are Consistent Across Training and Scale

15 July 2024

Stella Biderman

Papers citing "LLM Circuit Analyses Are Consistent Across Training and Scale"

16 / 16 papers shown

Title
Bigram Subnetworks: Mapping to Next Tokens in Transformer Language Models Tyler A. Chang Benjamin Bergen 38 0 0 21 Apr 2025
Geometric Signatures of Compositionality Across a Language Model's Lifetime Jin Hwa Lee Thomas Jiralerspong Lei Yu Yoshua Bengio Emily Cheng CoGe 82 0 0 02 Oct 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 57 18 0 02 Jul 2024
What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation Aaditya K. Singh Ted Moskovitz Felix Hill Stephanie C. Y. Chan Andrew M. Saxe AI4CE 42 24 0 10 Apr 2024
Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms Michael Hanna Sandro Pezzelle Yonatan Belinkov 51 33 0 26 Mar 2024
AtP*: An efficient and scalable method for localizing LLM behaviour to components János Kramár Tom Lieberum Rohin Shah Neel Nanda KELM 41 42 0 01 Mar 2024
Information Flow Routes: Automatically Interpreting Language Models at Scale Javier Ferrando Elena Voita 40 10 0 27 Feb 2024
OLMo: Accelerating the Science of Language Models Dirk Groeneveld Iz Beltagy Pete Walsh Akshita Bhagia Rodney Michael Kinney ... Jesse Dodge Kyle Lo Luca Soldaini Noah A. Smith Hanna Hajishirzi OSLM 124 349 0 01 Feb 2024
Attribution Patching Outperforms Automated Circuit Discovery Aaquib Syed Can Rager Arthur Conmy 55 53 0 16 Oct 2023
Finding Neurons in a Haystack: Case Studies with Sparse Probing Wes Gurnee Neel Nanda Matthew Pauly Katherine Harvey Dmitrii Troitskii Dimitris Bertsimas MILM 153 186 0 02 May 2023
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model Michael Hanna Ollie Liu Alexandre Variengien LRM 178 116 0 30 Apr 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 210 486 0 01 Nov 2022
In-context Learning and Induction Heads Catherine Olsson Nelson Elhage Neel Nanda Nicholas Joseph Nova Dassarma ... Tom B. Brown Jack Clark Jared Kaplan Sam McCandlish C. Olah 240 453 0 24 Sep 2022
Word Acquisition in Neural Language Models Tyler A. Chang Benjamin Bergen 27 29 0 05 Oct 2021
Probing Classifiers: Promises, Shortcomings, and Advances Yonatan Belinkov 221 291 0 24 Feb 2021
Scaling Laws for Neural Language Models Jared Kaplan Sam McCandlish T. Henighan Tom B. Brown B. Chess R. Child Scott Gray Alec Radford Jeff Wu Dario Amodei 220 3,054 0 23 Jan 2020