Finding Transformer Circuits with Edge Pruning

Finding Transformer Circuits with Edge Pruning

24 June 2024

Adithya Bhaskar

Alexander Wettig

Danqi Chen

Papers citing "Finding Transformer Circuits with Edge Pruning"

19 / 19 papers shown

Title
Scaling sparse feature circuit finding for in-context learning Dmitrii Kharlapenko S. Kamath S Fazl Barez Arthur Conmy Neel Nanda 16 0 0 18 Apr 2025
Are formal and functional linguistic mechanisms dissociated in language models? Michael Hanna Sandro Pezzelle Yonatan Belinkov 36 0 0 14 Mar 2025
An explainable transformer circuit for compositional generalization Cheng Tang Brenden Lake Mehrdad Jazayeri LRM 31 0 0 19 Feb 2025
Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis X. Wang Yan Hu Wenyu Du Reynold Cheng Benyou Wang Difan Zou 46 0 0 17 Feb 2025
The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It Leonardo Bertolazzi Philipp Mondorf Barbara Plank Raffaella Bernardi AIFin LRM 51 0 0 17 Feb 2025
EfficientLLM: Scalable Pruning-Aware Pretraining for Architecture-Agnostic Edge Language Models Xingrun Xing Zheng Liu Shitao Xiao Boyan Gao Yiming Liang Wanpeng Zhang Haokun Lin Guoqi Li Jiajun Zhang LRM 37 1 0 10 Feb 2025
Extracting Interpretable Task-Specific Circuits from Large Language Models for Faster Inference Jorge García-Carrasco A. Maté Juan Trujillo 66 0 0 20 Dec 2024
Activation Scaling for Steering and Interpreting Language Models Niklas Stoehr Kevin Du Vésteinn Snæbjarnarson Robert West Ryan Cotterell Aaron Schein LLMSV LRM 16 3 0 07 Oct 2024
Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models Philipp Mondorf Sondre Wold Barbara Plank 24 0 0 02 Oct 2024
Optimal ablation for interpretability Maximilian Li Lucas Janson FAtt 23 1 0 16 Sep 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 38 18 0 02 Jul 2024
Knowledge Circuits in Pretrained Transformers Yunzhi Yao Ningyu Zhang Zekun Xi Meng Wang Ziwen Xu Shumin Deng Huajun Chen KELM 34 19 0 28 May 2024
Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms Michael Hanna Sandro Pezzelle Yonatan Belinkov 43 31 0 26 Mar 2024
AtP*: An efficient and scalable method for localizing LLM behaviour to components János Kramár Tom Lieberum Rohin Shah Neel Nanda KELM 26 40 0 01 Mar 2024
Attention Lens: A Tool for Mechanistically Interpreting the Attention Head Information Retrieval Mechanism Mansi Sakarvadia Arham Khan Aswathy Ajith Daniel Grzenda Nathaniel Hudson André Bauer Kyle Chard Ian T. Foster 162 6 0 25 Oct 2023
Attribution Patching Outperforms Automated Circuit Discovery Aaquib Syed Can Rager Arthur Conmy 37 18 0 16 Oct 2023
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model Michael Hanna Ollie Liu Alexandre Variengien LRM 165 116 0 30 Apr 2023
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations Atticus Geiger Zhengxuan Wu Christopher Potts Thomas F. Icard Noah D. Goodman CML 73 98 0 05 Mar 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 205 295 0 01 Nov 2022