Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2407.02646
Cited By
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
2 July 2024
Daking Rai
Yilun Zhou
Shi Feng
Abulhair Saparov
Ziyu Yao
Re-assign community
ArXiv
PDF
HTML
Papers citing
"A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models"
30 / 30 papers shown
Title
In-Context Learning can distort the relationship between sequence likelihoods and biological fitness
Pranav Kantroo
Günter P. Wagner
Benjamin B. Machta
25
0
0
23 Apr 2025
Understanding the Skill Gap in Recurrent Language Models: The Role of the Gather-and-Aggregate Mechanism
Aviv Bick
Eric P. Xing
Albert Gu
RALM
81
0
0
22 Apr 2025
Layers at Similar Depths Generate Similar Activations Across LLM Architectures
Christopher Wolfram
Aaron Schein
18
0
0
03 Apr 2025
Reasoning about Affordances: Causal and Compositional Reasoning in LLMs
Magnus F. Gjerde
Vanessa Cheung
David Lagnado
ReLM
LRM
39
0
0
23 Feb 2025
SAE-V: Interpreting Multimodal Models for Enhanced Alignment
Hantao Lou
Changye Li
Jiaming Ji
Yaodong Yang
34
0
0
22 Feb 2025
An explainable transformer circuit for compositional generalization
Cheng Tang
Brenden Lake
Mehrdad Jazayeri
LRM
33
0
0
19 Feb 2025
Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis
X. Wang
Yan Hu
Wenyu Du
Reynold Cheng
Benyou Wang
Difan Zou
48
0
0
17 Feb 2025
Deciphering Functions of Neurons in Vision-Language Models
Jiaqi Xu
Cuiling Lan
Xuejin Chen
Yan Lu
VLM
61
0
0
10 Feb 2025
Interpretable Language Modeling via Induction-head Ngram Models
Eunji Kim
Sriya Mantena
Weiwei Yang
Chandan Singh
Sungroh Yoon
Jianfeng Gao
29
0
0
31 Oct 2024
Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders
Viacheslav Surkov
Chris Wendler
Mikhail Terekhov
Justin Deschenaux
Robert West
Çağlar Gülçehre
VLM
21
12
0
28 Oct 2024
Enforcing Interpretability in Time Series Transformers: A Concept Bottleneck Framework
Angela van Sprang
Erman Acar
Willem Zuidema
AI4TS
22
1
0
08 Oct 2024
System 2 Reasoning Capabilities Are Nigh
Scott C. Lowe
VLM
LRM
19
0
0
04 Oct 2024
Listening to the Wise Few: Select-and-Copy Attention Heads for Multiple-Choice QA
Eduard Tulchinskii
Laida Kushnareva
Kristian Kuznetsov
Anastasia Voznyuk
Andrei Andriiainen
Irina Piontkovskaya
Evgeny Burnaev
Serguei Barannikov
51
1
0
03 Oct 2024
Locating and Editing Factual Associations in Mamba
Arnab Sen Sharma
David Atkinson
David Bau
KELM
62
16
0
04 Apr 2024
AtP*: An efficient and scalable method for localizing LLM behaviour to components
János Kramár
Tom Lieberum
Rohin Shah
Neel Nanda
KELM
28
40
0
01 Mar 2024
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
Jing-ling Huang
Zhengxuan Wu
Christopher Potts
Mor Geva
Atticus Geiger
38
24
0
27 Feb 2024
Unified View of Grokking, Double Descent and Emergent Abilities: A Perspective from Circuits Competition
Yufei Huang
Shengding Hu
Xu Han
Zhiyuan Liu
Maosong Sun
50
6
0
23 Feb 2024
Increasing Trust in Language Models through the Reuse of Verified Circuits
Philip Quirke
Clement Neo
Fazl Barez
KELM
LRM
23
3
0
04 Feb 2024
Universal Neurons in GPT2 Language Models
Wes Gurnee
Theo Horsley
Zifan Carl Guo
Tara Rezaei Kheirkhah
Qinyi Sun
Will Hathaway
Neel Nanda
Dimitris Bertsimas
MILM
75
37
0
22 Jan 2024
Attribution Patching Outperforms Automated Circuit Discovery
Aaquib Syed
Can Rager
Arthur Conmy
39
18
0
16 Oct 2023
Finding Neurons in a Haystack: Case Studies with Sparse Probing
Wes Gurnee
Neel Nanda
Matthew Pauly
Katherine Harvey
Dmitrii Troitskii
Dimitris Bertsimas
MILM
150
170
0
02 May 2023
Dissecting Recall of Factual Associations in Auto-Regressive Language Models
Mor Geva
Jasmijn Bastings
Katja Filippova
Amir Globerson
KELM
180
152
0
28 Apr 2023
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sébastien Bubeck
Varun Chandrasekaran
Ronen Eldan
J. Gehrke
Eric Horvitz
...
Scott M. Lundberg
Harsha Nori
Hamid Palangi
Marco Tulio Ribeiro
Yi Zhang
ELM
AI4MH
AI4CE
ALM
197
2,232
0
22 Mar 2023
Crawling the Internal Knowledge-Base of Language Models
Roi Cohen
Mor Geva
Jonathan Berant
Amir Globerson
162
74
0
30 Jan 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
205
486
0
01 Nov 2022
In-context Learning and Induction Heads
Catherine Olsson
Nelson Elhage
Neel Nanda
Nicholas Joseph
Nova Dassarma
...
Tom B. Brown
Jack Clark
Jared Kaplan
Sam McCandlish
C. Olah
232
453
0
24 Sep 2022
Text and Patterns: For Effective Chain of Thought, It Takes Two to Tango
Aman Madaan
Amir Yazdanbakhsh
LRM
130
94
0
16 Sep 2022
Language Models as Knowledge Bases?
Fabio Petroni
Tim Rocktaschel
Patrick Lewis
A. Bakhtin
Yuxiang Wu
Alexander H. Miller
Sebastian Riedel
KELM
AI4MH
386
2,216
0
03 Sep 2019
What you can cram into a single vector: Probing sentence embeddings for linguistic properties
Alexis Conneau
Germán Kruszewski
Guillaume Lample
Loïc Barrault
Marco Baroni
196
876
0
03 May 2018
Towards A Rigorous Science of Interpretable Machine Learning
Finale Doshi-Velez
Been Kim
XAI
FaML
219
2,098
0
28 Feb 2017
1