Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2103.15949
Cited By
Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors
29 March 2021
Zeyu Yun
Yubei Chen
Bruno A. Olshausen
Yann LeCun
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors"
19 / 19 papers shown
Title
UNet with Axial Transformer : A Neural Weather Model for Precipitation Nowcasting
Maitreya Sonawane
Sumit Mamtani
65
0
0
28 Apr 2025
The Complexity of Learning Sparse Superposed Features with Feedback
Akash Kumar
158
0
0
08 Feb 2025
Out-of-distribution generalization via composition: a lens through induction heads in Transformers
Jiajun Song
Zhuoyan Xu
Yiqiao Zhong
85
4
0
31 Dec 2024
Beyond Label Attention: Transparency in Language Models for Automated Medical Coding via Dictionary Learning
John Wu
David Wu
Jimeng Sun
52
1
0
31 Oct 2024
Focus On This, Not That! Steering LLMs With Adaptive Feature Specification
Tom A. Lamb
Adam Davies
Alasdair Paren
Philip H. S. Torr
Francesco Pinto
47
0
0
30 Oct 2024
Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering
Yu Zhao
Alessio Devoto
Giwon Hong
Xiaotang Du
Aryo Pradipta Gema
Hongru Wang
Xuanli He
Kam-Fai Wong
Pasquale Minervini
KELM
LLMSV
36
16
0
21 Oct 2024
The Geometry of Concepts: Sparse Autoencoder Feature Structure
Yuxiao Li
Eric J. Michaud
David D. Baek
Joshua Engels
Xiaoqing Sun
Max Tegmark
52
7
0
10 Oct 2024
Residual Stream Analysis with Multi-Layer SAEs
Tim Lawson
Lucy Farnik
Conor Houghton
Laurence Aitchison
26
3
0
06 Sep 2024
Understanding Generative AI Content with Embedding Models
Max Vargas
Reilly Cannon
A. Engel
Anand D. Sarwate
Tony Chiang
52
3
0
19 Aug 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
Daking Rai
Yilun Zhou
Shi Feng
Abulhair Saparov
Ziyu Yao
82
19
0
02 Jul 2024
Codebook Features: Sparse and Discrete Interpretability for Neural Networks
Alex Tamkin
Mohammad Taufeeque
Noah D. Goodman
32
27
0
26 Oct 2023
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
Fred Zhang
Neel Nanda
LLMSV
33
97
0
27 Sep 2023
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Hoagy Cunningham
Aidan Ewart
Logan Riggs
R. Huben
Lee Sharkey
MILM
33
333
0
15 Sep 2023
Explaining black box text modules in natural language with language models
Chandan Singh
Aliyah R. Hsu
Richard Antonello
Shailee Jain
Alexander G. Huth
Bin-Xia Yu
Jianfeng Gao
MILM
31
46
0
17 May 2023
Minimalistic Unsupervised Learning with the Sparse Manifold Transform
Yubei Chen
Zeyu Yun
Y. Ma
Bruno A. Olshausen
Yann LeCun
52
8
0
30 Sep 2022
How to Dissect a Muppet: The Structure of Transformer Embedding Spaces
Timothee Mickus
Denis Paperno
Mathieu Constant
19
19
0
07 Jun 2022
Explainable Patterns for Distinction and Prediction of Moral Judgement on Reddit
Ion Stagkos Efstathiadis
Guilherme Paulino-Passos
Francesca Toni
24
8
0
26 Jan 2022
Translation Error Detection as Rationale Extraction
M. Fomicheva
Lucia Specia
Nikolaos Aletras
13
23
0
27 Aug 2021
Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers
Hila Chefer
Shir Gur
Lior Wolf
ViT
22
303
0
29 Mar 2021
1