Transformer Dissection: A Unified Understanding of Transformer's Attention via the Lens of Kernel

30 August 2019

Yao-Hung Hubert Tsai

Shaojie Bai

M. Yamada

Louis-Philippe Morency

Ruslan Salakhutdinov

ArXiv PDF HTML

Papers citing "Transformer Dissection: A Unified Understanding of Transformer's Attention via the Lens of Kernel"

39 / 39 papers shown

Title
A Reproduction Study: The Kernel PCA Interpretation of Self-Attention Fails Under Scrutiny Karahan Sarıtaş Çağatay Yıldız 31 0 0 12 May 2025
Transformer Meets Twicing: Harnessing Unattended Residual Information Laziz U. Abdullaev Tan M. Nguyen 41 2 0 02 Mar 2025
Video Latent Flow Matching: Optimal Polynomial Projections for Video Interpolation and Extrapolation Yang Cao Zhao-quan Song Chiwun Yang VGen 46 2 0 01 Feb 2025
Tensor Product Attention Is All You Need Yifan Zhang Yifeng Liu Huizhuo Yuan Zhen Qin Yang Yuan Q. Gu Andrew Chi-Chih Yao 77 9 0 11 Jan 2025
Key-value memory in the brain Samuel J. Gershman Ila Fiete Kazuki Irie 34 7 0 06 Jan 2025
Fast Gradient Computation for RoPE Attention in Almost Linear Time Yifang Chen Jiayan Huo Xiaoyu Li Yingyu Liang Zhenmei Shi Zhao-quan Song 61 11 0 03 Jan 2025
Generative Adapter: Contextualizing Language Models in Parameters with A Single Forward Pass Tong Chen Hao Fang Patrick Xia Xiaodong Liu Benjamin Van Durme Luke Zettlemoyer Jianfeng Gao Hao Cheng KELM 51 2 0 08 Nov 2024
Context-Scaling versus Task-Scaling in In-Context Learning Amirhesam Abedsoltan Adityanarayanan Radhakrishnan Jingfeng Wu M. Belkin ReLM LRM 40 3 0 16 Oct 2024
How Effective are State Space Models for Machine Translation? Hugo Pitorro Pavlo Vasylenko Marcos Vinícius Treviso André F. T. Martins Mamba 45 2 0 07 Jul 2024
DiJiang: Efficient Large Language Models through Compact Kernelization Hanting Chen Zhicheng Liu Xutao Wang Yuchuan Tian Yunhe Wang VLM 26 5 0 29 Mar 2024
Data-free Weight Compress and Denoise for Large Language Models Runyu Peng Yunhua Zhou Qipeng Guo Yang Gao Hang Yan Xipeng Qiu Dahua Lin 39 1 0 26 Feb 2024
Breaking Symmetry When Training Transformers Chunsheng Zuo Michael Guerzhoy 30 0 0 06 Feb 2024
DF2: Distribution-Free Decision-Focused Learning Lingkai Kong Wenhao Mu Jiaming Cui Yuchen Zhuang B. Prakash Bo Dai Chao Zhang OffRL 36 1 0 11 Aug 2023
Inductive biases in deep learning models for weather prediction Jannik Thümmel Matthias Karlbauer S. Otte C. Zarfl Georg Martius ... Thomas Scholten Ulrich Friedrich V. Wulfmeyer B. Goswami Martin Volker Butz AI4CE 38 5 0 06 Apr 2023
Learning a Fourier Transform for Linear Relative Positional Encodings in Transformers K. Choromanski Shanda Li Valerii Likhosherstov Kumar Avinava Dubey Shengjie Luo Di He Yiming Yang Tamás Sarlós Thomas Weingarten Adrian Weller 28 8 0 03 Feb 2023
An Analysis of Attention via the Lens of Exchangeability and Latent Variable Models Yufeng Zhang Boyi Liu Qi Cai Lingxiao Wang Zhaoran Wang 45 11 0 30 Dec 2022
HigeNet: A Highly Efficient Modeling for Long Sequence Time Series Prediction in AIOps Jiajia Li Feng Tan Cheng He Zikai Wang Haitao Song Lingfei Wu Pengwei Hu 12 0 0 13 Nov 2022
Features Fusion Framework for Multimodal Irregular Time-series Events Peiwang Tang Xianchao Zhang AI4TS 26 2 0 05 Sep 2022
Momentum Transformer: Closing the Performance Gap Between Self-attention and Its Linearization T. Nguyen Richard G. Baraniuk Robert M. Kirby Stanley J. Osher Bao Wang 21 9 0 01 Aug 2022
KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation Ta-Chung Chi Ting-Han Fan Peter J. Ramadge Alexander I. Rudnicky 39 65 0 20 May 2022
Approximating Permutations with Neural Network Components for Travelling Photographer Problem S. Chong 12 0 0 30 Apr 2022
A Call for Clarity in Beam Search: How It Works and When It Stops Jungo Kasai Keisuke Sakaguchi Ronan Le Bras Dragomir R. Radev Yejin Choi Noah A. Smith 26 6 0 11 Apr 2022
Wasserstein Adversarial Transformer for Cloud Workload Prediction Shivani Arbat V. Jayakumar Jaewoo Lee Wei Wang I. Kim AI4TS 6 22 0 12 Mar 2022
cosFormer: Rethinking Softmax in Attention Zhen Qin Weixuan Sun Huicai Deng Dongxu Li Yunshen Wei Baohong Lv Junjie Yan Lingpeng Kong Yiran Zhong 24 211 0 17 Feb 2022
Learning Operators with Coupled Attention Georgios Kissas Jacob H. Seidman Leonardo Ferreira Guilhoto V. Preciado George J. Pappas P. Perdikaris 24 109 0 04 Jan 2022
Trading with the Momentum Transformer: An Intelligent and Interpretable Architecture Kieran Wood Sven Giegerich Stephen J. Roberts S. Zohren AI4TS AIFin 13 21 0 16 Dec 2021
Transformers for prompt-level EMA non-response prediction Supriya Nagesh Alexander Moreno Stephanie M Carpenter Jamie Yap Soujanya Chatterjee ... Santosh Kumar Cho Lam D. Wetter Inbal Nahum-Shani James M. Rehg 14 0 0 01 Nov 2021
Ultra-high Resolution Image Segmentation via Locality-aware Context Fusion and Alternating Local Enhancement Wenxi Liu Qi Li Xin Lin Weixiang Yang Shengfeng He Yuanlong Yu 29 7 0 06 Sep 2021
GraphiT: Encoding Graph Structure in Transformers Grégoire Mialon Dexiong Chen Margot Selosse Julien Mairal 20 163 0 10 Jun 2021
CoAtNet: Marrying Convolution and Attention for All Data Sizes Zihang Dai Hanxiao Liu Quoc V. Le Mingxing Tan ViT 49 1,167 0 09 Jun 2021
A Survey of Transformers Tianyang Lin Yuxin Wang Xiangyang Liu Xipeng Qiu ViT 32 1,086 0 08 Jun 2021
Choose a Transformer: Fourier or Galerkin Shuhao Cao 39 220 0 31 May 2021
Relative Positional Encoding for Transformers with Linear Complexity Antoine Liutkus Ondřej Cífka Shih-Lun Wu Umut Simsekli Yi-Hsuan Yang Gaël Richard 25 44 0 18 May 2021
Linear Transformers Are Secretly Fast Weight Programmers Imanol Schlag Kazuki Irie Jürgen Schmidhuber 29 224 0 22 Feb 2021
Rethinking Attention with Performers K. Choromanski Valerii Likhosherstov David Dohan Xingyou Song Andreea Gane ... Afroz Mohiuddin Lukasz Kaiser David Belanger Lucy J. Colwell Adrian Weller 8 1,517 0 30 Sep 2020
On the Computational Power of Transformers and its Implications in Sequence Modeling S. Bhattamishra Arkil Patel Navin Goyal 25 63 0 16 Jun 2020
The Lipschitz Constant of Self-Attention Hyunjik Kim George Papamakarios A. Mnih 14 134 0 08 Jun 2020
Kernel Self-Attention in Deep Multiple Instance Learning Dawid Rymarczyk Adriana Borowa Jacek Tabor Bartosz Zieliñski SSL 14 5 0 25 May 2020
Classical Structured Prediction Losses for Sequence to Sequence Learning Sergey Edunov Myle Ott Michael Auli David Grangier MarcÁurelio Ranzato AIMat 48 185 0 14 Nov 2017