ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2407.02646
  4. Cited By
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models

A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models

2 July 2024
Daking Rai
Yilun Zhou
Shi Feng
Abulhair Saparov
Ziyu Yao
ArXivPDFHTML

Papers citing "A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models"

30 / 30 papers shown
Title
In-Context Learning can distort the relationship between sequence likelihoods and biological fitness
In-Context Learning can distort the relationship between sequence likelihoods and biological fitness
Pranav Kantroo
Günter P. Wagner
Benjamin B. Machta
25
0
0
23 Apr 2025
Understanding the Skill Gap in Recurrent Language Models: The Role of the Gather-and-Aggregate Mechanism
Understanding the Skill Gap in Recurrent Language Models: The Role of the Gather-and-Aggregate Mechanism
Aviv Bick
Eric P. Xing
Albert Gu
RALM
81
0
0
22 Apr 2025
Layers at Similar Depths Generate Similar Activations Across LLM Architectures
Layers at Similar Depths Generate Similar Activations Across LLM Architectures
Christopher Wolfram
Aaron Schein
18
0
0
03 Apr 2025
Reasoning about Affordances: Causal and Compositional Reasoning in LLMs
Reasoning about Affordances: Causal and Compositional Reasoning in LLMs
Magnus F. Gjerde
Vanessa Cheung
David Lagnado
ReLM
LRM
39
0
0
23 Feb 2025
SAE-V: Interpreting Multimodal Models for Enhanced Alignment
SAE-V: Interpreting Multimodal Models for Enhanced Alignment
Hantao Lou
Changye Li
Jiaming Ji
Yaodong Yang
34
0
0
22 Feb 2025
An explainable transformer circuit for compositional generalization
An explainable transformer circuit for compositional generalization
Cheng Tang
Brenden Lake
Mehrdad Jazayeri
LRM
33
0
0
19 Feb 2025
Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis
Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis
X. Wang
Yan Hu
Wenyu Du
Reynold Cheng
Benyou Wang
Difan Zou
48
0
0
17 Feb 2025
Deciphering Functions of Neurons in Vision-Language Models
Deciphering Functions of Neurons in Vision-Language Models
Jiaqi Xu
Cuiling Lan
Xuejin Chen
Yan Lu
VLM
61
0
0
10 Feb 2025
Interpretable Language Modeling via Induction-head Ngram Models
Interpretable Language Modeling via Induction-head Ngram Models
Eunji Kim
Sriya Mantena
Weiwei Yang
Chandan Singh
Sungroh Yoon
Jianfeng Gao
29
0
0
31 Oct 2024
Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse
  Autoencoders
Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders
Viacheslav Surkov
Chris Wendler
Mikhail Terekhov
Justin Deschenaux
Robert West
Çağlar Gülçehre
VLM
21
12
0
28 Oct 2024
Enforcing Interpretability in Time Series Transformers: A Concept
  Bottleneck Framework
Enforcing Interpretability in Time Series Transformers: A Concept Bottleneck Framework
Angela van Sprang
Erman Acar
Willem Zuidema
AI4TS
22
1
0
08 Oct 2024
System 2 Reasoning Capabilities Are Nigh
System 2 Reasoning Capabilities Are Nigh
Scott C. Lowe
VLM
LRM
19
0
0
04 Oct 2024
Listening to the Wise Few: Select-and-Copy Attention Heads for
  Multiple-Choice QA
Listening to the Wise Few: Select-and-Copy Attention Heads for Multiple-Choice QA
Eduard Tulchinskii
Laida Kushnareva
Kristian Kuznetsov
Anastasia Voznyuk
Andrei Andriiainen
Irina Piontkovskaya
Evgeny Burnaev
Serguei Barannikov
51
1
0
03 Oct 2024
Locating and Editing Factual Associations in Mamba
Locating and Editing Factual Associations in Mamba
Arnab Sen Sharma
David Atkinson
David Bau
KELM
62
16
0
04 Apr 2024
AtP*: An efficient and scalable method for localizing LLM behaviour to
  components
AtP*: An efficient and scalable method for localizing LLM behaviour to components
János Kramár
Tom Lieberum
Rohin Shah
Neel Nanda
KELM
28
40
0
01 Mar 2024
RAVEL: Evaluating Interpretability Methods on Disentangling Language
  Model Representations
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
Jing-ling Huang
Zhengxuan Wu
Christopher Potts
Mor Geva
Atticus Geiger
38
24
0
27 Feb 2024
Unified View of Grokking, Double Descent and Emergent Abilities: A
  Perspective from Circuits Competition
Unified View of Grokking, Double Descent and Emergent Abilities: A Perspective from Circuits Competition
Yufei Huang
Shengding Hu
Xu Han
Zhiyuan Liu
Maosong Sun
50
6
0
23 Feb 2024
Increasing Trust in Language Models through the Reuse of Verified
  Circuits
Increasing Trust in Language Models through the Reuse of Verified Circuits
Philip Quirke
Clement Neo
Fazl Barez
KELM
LRM
23
3
0
04 Feb 2024
Universal Neurons in GPT2 Language Models
Universal Neurons in GPT2 Language Models
Wes Gurnee
Theo Horsley
Zifan Carl Guo
Tara Rezaei Kheirkhah
Qinyi Sun
Will Hathaway
Neel Nanda
Dimitris Bertsimas
MILM
75
37
0
22 Jan 2024
Attribution Patching Outperforms Automated Circuit Discovery
Attribution Patching Outperforms Automated Circuit Discovery
Aaquib Syed
Can Rager
Arthur Conmy
39
18
0
16 Oct 2023
Finding Neurons in a Haystack: Case Studies with Sparse Probing
Finding Neurons in a Haystack: Case Studies with Sparse Probing
Wes Gurnee
Neel Nanda
Matthew Pauly
Katherine Harvey
Dmitrii Troitskii
Dimitris Bertsimas
MILM
150
170
0
02 May 2023
Dissecting Recall of Factual Associations in Auto-Regressive Language
  Models
Dissecting Recall of Factual Associations in Auto-Regressive Language Models
Mor Geva
Jasmijn Bastings
Katja Filippova
Amir Globerson
KELM
180
152
0
28 Apr 2023
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sébastien Bubeck
Varun Chandrasekaran
Ronen Eldan
J. Gehrke
Eric Horvitz
...
Scott M. Lundberg
Harsha Nori
Hamid Palangi
Marco Tulio Ribeiro
Yi Zhang
ELM
AI4MH
AI4CE
ALM
197
2,232
0
22 Mar 2023
Crawling the Internal Knowledge-Base of Language Models
Crawling the Internal Knowledge-Base of Language Models
Roi Cohen
Mor Geva
Jonathan Berant
Amir Globerson
162
74
0
30 Jan 2023
Interpretability in the Wild: a Circuit for Indirect Object
  Identification in GPT-2 small
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
205
486
0
01 Nov 2022
In-context Learning and Induction Heads
In-context Learning and Induction Heads
Catherine Olsson
Nelson Elhage
Neel Nanda
Nicholas Joseph
Nova Dassarma
...
Tom B. Brown
Jack Clark
Jared Kaplan
Sam McCandlish
C. Olah
232
453
0
24 Sep 2022
Text and Patterns: For Effective Chain of Thought, It Takes Two to Tango
Text and Patterns: For Effective Chain of Thought, It Takes Two to Tango
Aman Madaan
Amir Yazdanbakhsh
LRM
130
94
0
16 Sep 2022
Language Models as Knowledge Bases?
Language Models as Knowledge Bases?
Fabio Petroni
Tim Rocktaschel
Patrick Lewis
A. Bakhtin
Yuxiang Wu
Alexander H. Miller
Sebastian Riedel
KELM
AI4MH
386
2,216
0
03 Sep 2019
What you can cram into a single vector: Probing sentence embeddings for
  linguistic properties
What you can cram into a single vector: Probing sentence embeddings for linguistic properties
Alexis Conneau
Germán Kruszewski
Guillaume Lample
Loïc Barrault
Marco Baroni
196
876
0
03 May 2018
Towards A Rigorous Science of Interpretable Machine Learning
Towards A Rigorous Science of Interpretable Machine Learning
Finale Doshi-Velez
Been Kim
XAI
FaML
219
2,098
0
28 Feb 2017
1