Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2403.00824
Cited By
Information Flow Routes: Automatically Interpreting Language Models at Scale
27 February 2024
Javier Ferrando
Elena Voita
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Information Flow Routes: Automatically Interpreting Language Models at Scale"
9 / 9 papers shown
Title
HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks
Jiuding Sun
Jing Huang
Sidharth Baskaran
Karel DÓosterlinck
Christopher Potts
Michael Sklar
Atticus Geiger
AI4CE
55
0
0
13 Mar 2025
ReDeEP: Detecting Hallucination in Retrieval-Augmented Generation via Mechanistic Interpretability
ZhongXiang Sun
Xiaoxue Zang
Kai Zheng
Yang Song
Jun Xu
Xiao Zhang
Weijie Yu
Yang Song
Han Li
35
6
0
15 Oct 2024
From Tokens to Words: On the Inner Lexicon of LLMs
Guy Kaplan
Matanel Oren
Yuval Reif
Roy Schwartz
32
12
0
08 Oct 2024
Knowledge Circuits in Pretrained Transformers
Yunzhi Yao
Ningyu Zhang
Zekun Xi
Meng Wang
Ziwen Xu
Shumin Deng
Huajun Chen
KELM
45
19
0
28 May 2024
Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms
Michael Hanna
Sandro Pezzelle
Yonatan Belinkov
51
33
0
26 Mar 2024
AtP*: An efficient and scalable method for localizing LLM behaviour to components
János Kramár
Tom Lieberum
Rohin Shah
Neel Nanda
KELM
36
40
0
01 Mar 2024
Spectral Filters, Dark Signals, and Attention Sinks
Nicola Cancedda
48
5
0
14 Feb 2024
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model
Michael Hanna
Ollie Liu
Alexandre Variengien
LRM
173
116
0
30 Apr 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
205
486
0
01 Nov 2022
1