Transformer visualization via dictionary learning: contextualized
embedding as a linear superposition of transformer factors

Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors

29 March 2021

Bruno A. Olshausen

Papers citing "Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors"

19 / 19 papers shown

Title
UNet with Axial Transformer : A Neural Weather Model for Precipitation Nowcasting Maitreya Sonawane Sumit Mamtani 65 0 0 28 Apr 2025
The Complexity of Learning Sparse Superposed Features with Feedback Akash Kumar 158 0 0 08 Feb 2025
Out-of-distribution generalization via composition: a lens through induction heads in Transformers Jiajun Song Zhuoyan Xu Yiqiao Zhong 85 4 0 31 Dec 2024
Beyond Label Attention: Transparency in Language Models for Automated Medical Coding via Dictionary Learning John Wu David Wu Jimeng Sun 52 1 0 31 Oct 2024
Focus On This, Not That! Steering LLMs With Adaptive Feature Specification Tom A. Lamb Adam Davies Alasdair Paren Philip H. S. Torr Francesco Pinto 47 0 0 30 Oct 2024
Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering Yu Zhao Alessio Devoto Giwon Hong Xiaotang Du Aryo Pradipta Gema Hongru Wang Xuanli He Kam-Fai Wong Pasquale Minervini KELM LLMSV 36 16 0 21 Oct 2024
The Geometry of Concepts: Sparse Autoencoder Feature Structure Yuxiao Li Eric J. Michaud David D. Baek Joshua Engels Xiaoqing Sun Max Tegmark 52 7 0 10 Oct 2024
Residual Stream Analysis with Multi-Layer SAEs Tim Lawson Lucy Farnik Conor Houghton Laurence Aitchison 26 3 0 06 Sep 2024
Understanding Generative AI Content with Embedding Models Max Vargas Reilly Cannon A. Engel Anand D. Sarwate Tony Chiang 52 3 0 19 Aug 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 82 19 0 02 Jul 2024
Codebook Features: Sparse and Discrete Interpretability for Neural Networks Alex Tamkin Mohammad Taufeeque Noah D. Goodman 32 27 0 26 Oct 2023
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods Fred Zhang Neel Nanda LLMSV 33 97 0 27 Sep 2023
Sparse Autoencoders Find Highly Interpretable Features in Language Models Hoagy Cunningham Aidan Ewart Logan Riggs R. Huben Lee Sharkey MILM 33 333 0 15 Sep 2023
Explaining black box text modules in natural language with language models Chandan Singh Aliyah R. Hsu Richard Antonello Shailee Jain Alexander G. Huth Bin-Xia Yu Jianfeng Gao MILM 31 46 0 17 May 2023
Minimalistic Unsupervised Learning with the Sparse Manifold Transform Yubei Chen Zeyu Yun Y. Ma Bruno A. Olshausen Yann LeCun 52 8 0 30 Sep 2022
How to Dissect a Muppet: The Structure of Transformer Embedding Spaces Timothee Mickus Denis Paperno Mathieu Constant 19 19 0 07 Jun 2022
Explainable Patterns for Distinction and Prediction of Moral Judgement on Reddit Ion Stagkos Efstathiadis Guilherme Paulino-Passos Francesca Toni 24 8 0 26 Jan 2022
Translation Error Detection as Rationale Extraction M. Fomicheva Lucia Specia Nikolaos Aletras 13 23 0 27 Aug 2021
Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers Hila Chefer Shir Gur Lior Wolf ViT 22 303 0 29 Mar 2021