Progress measures for grokking via mechanistic interpretability

12 January 2023

Papers citing "Progress measures for grokking via mechanistic interpretability"

50 / 63 papers shown

Title
Understanding In-context Learning of Addition via Activation Subspaces Xinyan Hu Kayo Yin Michael I. Jordan Jacob Steinhardt Lijie Chen 49 0 0 08 May 2025
Quiet Feature Learning in Algorithmic Tasks Prudhviraj Naidu Zixian Wang Leon Bergen R. Paturi VLM 52 0 0 06 May 2025
Contextures: Representations from Contexts Runtian Zhai Kai Yang Che-Ping Tsai Burak Varici Zico Kolter Pradeep Ravikumar 60 0 0 02 May 2025
A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i Kola Ayonrinde Louis Jaburi MILM 82 1 0 01 May 2025
Jekyll-and-Hyde Tipping Point in an AI's Behavior Neil F. Johnson Frank Yingjie Huo 46 0 0 29 Apr 2025
Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers Roman Abramov Felix Steinbauer Gjergji Kasneci 78 0 0 29 Apr 2025
Reinforcement Learning for Reasoning in Large Language Models with One Training Example Yiping Wang Qing Yang Zhiyuan Zeng Liliang Ren L. Liu ... Jianfeng Gao Weizhu Chen S. Wang Simon S. Du Yelong Shen OffRL ReLM LRM 110 2 0 29 Apr 2025
Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video Sonia Joseph Praneet Suresh Lorenz Hufe Edward Stevinson Robert Graham Yash Vadi Danilo Bzdok Sebastian Lapuschkin Lee Sharkey Blake A. Richards 72 0 0 28 Apr 2025
Studying Small Language Models with Susceptibilities Garrett Baker George Wang Jesse Hoogland Daniel Murfet AAML 73 1 0 25 Apr 2025
MIB: A Mechanistic Interpretability Benchmark Aaron Mueller Atticus Geiger Sarah Wiegreffe Dana Arad Iván Arcuschin ... Alessandro Stolfo Martin Tutek Amir Zur David Bau Yonatan Belinkov 41 1 0 17 Apr 2025
Towards Combinatorial Interpretability of Neural Computation Micah Adler Dan Alistarh Nir Shavit FAtt 87 1 0 10 Apr 2025
Low Rank and Sparse Fourier Structure in Recurrent Networks Trained on Modular Addition Akshay Rangamani 40 0 0 28 Mar 2025
Implicit Reasoning in Transformers is Reasoning through Shortcuts Tianhe Lin Jian Xie Siyu Yuan Deqing Yang ReLM LRM 66 2 0 10 Mar 2025
Early Stopping Against Label Noise Without Validation Data Suqin Yuan Lei Feng Tongliang Liu NoLa 93 14 0 11 Feb 2025
Modular Training of Neural Networks aids Interpretability Satvik Golechha Maheep Chaudhary Joan Velja Alessandro Abate Nandi Schoots 74 0 0 04 Feb 2025
Physics of Skill Learning Ziming Liu Yizhou Liu Eric J. Michaud Jeff Gore Max Tegmark 41 0 0 21 Jan 2025
Episodic memory in AI agents poses risks that should be studied and mitigated Chad DeChant 57 1 0 20 Jan 2025
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words Gouki Minegishi Hiroki Furuta Yusuke Iwasawa Y. Matsuo 49 1 0 09 Jan 2025
How to explain grokking S. V. Kozyrev AI4CE 24 0 0 03 Jan 2025
Out-of-distribution generalization via composition: a lens through induction heads in Transformers Jiajun Song Zhuoyan Xu Yiqiao Zhong 78 4 0 31 Dec 2024
Tracking the Feature Dynamics in LLM Training: A Mechanistic Study Yang Xu Y. Wang Hao Wang 77 1 0 23 Dec 2024
Interacting Large Language Model Agents. Interpretable Models and Social Learning Adit Jain Vikram Krishnamurthy LLMAG 28 0 0 02 Nov 2024
Analyzing (In)Abilities of SAEs via Formal Languages Abhinav Menon Manish Shrivastava David M. Krueger Ekdeep Singh Lubana 42 7 0 15 Oct 2024
Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models Guobin Shen Dongcheng Zhao Yiting Dong Xiang-Yu He Yi Zeng AAML 45 0 0 03 Oct 2024
Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models Philipp Mondorf Sondre Wold Barbara Plank 29 0 0 02 Oct 2024
Exploring Information-Theoretic Metrics Associated with Neural Collapse in Supervised Training Kun Song Zhiquan Tan Bochao Zou Jiansheng Chen Huimin Ma Weiran Huang 37 0 0 25 Sep 2024
Language Models "Grok" to Copy Ang Lv Ruobing Xie Xingwu Sun Zhanhui Kang Rui Yan LLMAG 41 1 0 14 Sep 2024
Information-Theoretic Progress Measures reveal Grokking is an Emergent Phase Transition Kenzo Clauw S. Stramaglia Daniele Marinazzo 50 3 0 16 Aug 2024
A Mechanistic Interpretation of Syllogistic Reasoning in Auto-Regressive Language Models Geonhee Kim Marco Valentino André Freitas LRM AI4CE 28 7 0 16 Aug 2024
Probabilistic Parameter Estimators and Calibration Metrics for Pose Estimation from Image Features Romeo Valentin Sydney M. Katz Joonghyun Lee Don Walker Matthew Sorgenfrei Mykel J. Kochenderfer 24 0 0 23 Jul 2024
Representing Rule-based Chatbots with Transformers Dan Friedman Abhishek Panigrahi Danqi Chen 59 1 0 15 Jul 2024
Transformer Circuit Faithfulness Metrics are not Robust Joseph Miller Bilal Chughtai William Saunders 45 7 0 11 Jul 2024
Frequency and Generalisation of Periodic Activation Functions in Reinforcement Learning Augustine N. Mavor-Parker Matthew J. Sargent Caswell Barry Lewis D. Griffin Clare Lyle 37 2 0 09 Jul 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 75 18 0 02 Jul 2024
Does ChatGPT Have a Mind? Simon Goldstein B. Levinstein AI4MH LRM 24 5 0 27 Jun 2024
Interpreting the Second-Order Effects of Neurons in CLIP Yossi Gandelsman Alexei A. Efros Jacob Steinhardt MILM 48 16 0 06 Jun 2024
LOLAMEME: Logic, Language, Memory, Mechanistic Framework Jay Desai Xiaobo Guo Srinivasan H. Sengamedu 14 0 0 31 May 2024
Standards for Belief Representations in LLMs Daniel A. Herrmann B. Levinstein 34 6 0 31 May 2024
Survival of the Fittest Representation: A Case Study with Modular Addition Xiaoman Delores Ding Zifan Carl Guo Eric J. Michaud Ziming Liu Max Tegmark 34 3 0 27 May 2024
Bayesian RG Flow in Neural Network Field Theories Jessica N. Howard Marc S. Klinger Anindita Maiti A. G. Stapleton 60 1 0 27 May 2024
KAN: Kolmogorov-Arnold Networks Ziming Liu Yixuan Wang Sachin Vaidya Fabian Ruehle James Halverson Marin Soljacic Thomas Y. Hou Max Tegmark 72 460 0 30 Apr 2024
Opening the AI black box: program synthesis via mechanistic interpretability Eric J. Michaud Isaac Liao Vedang Lad Ziming Liu Anish Mudide Chloe Loughridge Zifan Carl Guo Tara Rezaei Kheirkhah Mateja Vukelić Max Tegmark 23 12 0 07 Feb 2024
Black-Box Access is Insufficient for Rigorous AI Audits Stephen Casper Carson Ezell Charlotte Siegmann Noam Kolt Taylor Lynn Curtis ... Michael Gerovitch David Bau Max Tegmark David M. Krueger Dylan Hadfield-Menell AAML 13 76 0 25 Jan 2024
Carrying over algorithm in transformers J. Kruthoff 19 0 0 15 Jan 2024
ALMANACS: A Simulatability Benchmark for Language Model Explainability Edmund Mills Shiye Su Stuart J. Russell Scott Emmons 43 7 0 20 Dec 2023
Forbidden Facts: An Investigation of Competing Objectives in Llama-2 Tony T. Wang Miles Wang Kaivu Hariharan Nir Shavit 16 2 0 14 Dec 2023
FlexModel: A Framework for Interpretability of Distributed Large Language Models Matthew Choi Muhammad Adil Asif John Willes David Emerson AI4CE ALM 14 1 0 05 Dec 2023
A Glitch in the Matrix? Locating and Detecting Language Model Grounding with Fakepedia Giovanni Monea Maxime Peyrard Martin Josifoski Vishrav Chaudhary Jason Eisner Emre Kiciman Hamid Palangi Barun Patra Robert West KELM 47 12 0 04 Dec 2023
Labeling Neural Representations with Inverse Recognition Kirill Bykov Laura Kopf Shinichi Nakajima Marius Kloft Marina M.-C. Höhne BDL 19 15 0 22 Nov 2023
Compositional Capabilities of Autoregressive Transformers: A Study on Synthetic, Interpretable Tasks Rahul Ramesh Ekdeep Singh Lubana Mikail Khona Robert P. Dick Hidenori Tanaka CoGe 27 6 0 21 Nov 2023