Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth

5 March 2021

Papers citing "Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth"

50 / 238 papers shown

Title
Characterizing Large Language Model Geometry Helps Solve Toxicity Detection and Generation Randall Balestriero Romain Cosentino Sarath Shekkizhar 28 2 0 04 Dec 2023
Bootstrapping Interactive Image-Text Alignment for Remote Sensing Image Captioning Cong Yang Zuchao Li Lefei Zhang 29 23 0 02 Dec 2023
Pointer Networks Trained Better via Evolutionary Algorithms Muyao Zhong Shengcai Liu Bingdong Li Haobo Fu Ke Tang Peng Yang 23 0 0 02 Dec 2023
Mitigating Over-smoothing in Transformers via Regularized Nonlocal Functionals Tam Nguyen Tan-Minh Nguyen Richard G. Baraniuk 21 8 0 01 Dec 2023
SCHEME: Scalable Channel Mixer for Vision Transformers Deepak Sridhar Yunsheng Li Nuno Vasconcelos 18 0 0 01 Dec 2023
Probabilistic Transformer: A Probabilistic Dependency Model for Contextual Word Representation Haoyi Wu Kewei Tu 99 3 0 26 Nov 2023
p-Laplacian Transformer Tuan Nguyen Tam Nguyen Vinh-Tiep Nguyen Tan-Minh Nguyen 69 0 0 06 Nov 2023
Simplifying Transformer Blocks Bobby He Thomas Hofmann 19 30 0 03 Nov 2023
Sliceformer: Make Multi-head Attention as Simple as Sorting in Discriminative Tasks Shen Yuan Hongteng Xu 16 0 0 26 Oct 2023
Circuit as Set of Points Jialv Zou Xinggang Wang Jiahao Guo Wenyu Liu Qian Zhang Chang Huang GNN 3DV 3DPC 23 0 0 26 Oct 2023
Unraveling Feature Extraction Mechanisms in Neural Networks Xiaobing Sun Jiaxi Li Wei Lu 18 0 0 25 Oct 2023
PartialFormer: Modeling Part Instead of Whole for Machine Translation Tong Zheng Bei Li Huiwen Bao Jiale Wang Weiqiao Shan Tong Xiao Jingbo Zhu MoE AI4CE 11 0 0 23 Oct 2023
Eureka-Moments in Transformers: Multi-Step Tasks Reveal Softmax Induced Optimization Problems David T. Hoffmann Simon Schrodi Jelena Bratulić Nadine Behrmann Volker Fischer Thomas Brox 30 5 0 19 Oct 2023
On the Optimization and Generalization of Multi-head Attention Puneesh Deora Rouzbeh Ghaderi Hossein Taheri Christos Thrampoulidis MLT 39 33 0 19 Oct 2023
Language Models are Universal Embedders Xin Zhang Zehan Li Yanzhao Zhang Dingkun Long Pengjun Xie Meishan Zhang Min Zhang KELM ELM 35 6 0 12 Oct 2023
Towards Training Without Depth Limits: Batch Normalization Without Gradient Explosion Alexandru Meterez Amir Joudaki Francesco Orabona Alexander Immer Gunnar Rätsch Hadi Daneshmand 24 8 0 03 Oct 2023
Transformers are efficient hierarchical chemical graph learners Zihan Pengmei Zimu Li Chih-chan Tien Risi Kondor Aaron R Dinner GNN 21 1 0 02 Oct 2023
Symmetry Induces Structure and Constraint of Learning Liu Ziyin 26 10 0 29 Sep 2023
RBFormer: Improve Adversarial Robustness of Transformer by Robust Bias Hao Cheng Jinhao Duan Hui Li Lyutianyang Zhang Jiahang Cao Ping Wang Jize Zhang Kaidi Xu Renjing Xu AAML 21 3 0 23 Sep 2023
Attention-Only Transformers and Implementing MLPs with Attention Heads R. Huben Valerie Morris 11 0 0 15 Sep 2023
Temporal Action Localization with Enhanced Instant Discriminability Ding Shi Qiong Cao Yujie Zhong Shan An Jian Cheng Haogang Zhu Dacheng Tao 27 9 0 11 Sep 2023
Transformers as Support Vector Machines Davoud Ataee Tarzanagh Yingcong Li Christos Thrampoulidis Samet Oymak 35 43 0 31 Aug 2023
Rank Collapse Causes Over-Smoothing and Over-Correlation in Graph Neural Networks Andreas Roth Thomas Liebig 29 11 0 31 Aug 2023
Self-Feedback DETR for Temporal Action Detection Jihwan Kim Miso Lee Jae-Pil Heo 37 17 0 21 Aug 2023
The Costly Dilemma: Generalization, Evaluation and Cost-Optimal Deployment of Large Language Models Abi Aryan Aakash Kumar Nain Andrew McMahon Lucas Augusto Meyer Harpreet Sahota 22 6 0 15 Aug 2023
SCSC: Spatial Cross-scale Convolution Module to Strengthen both CNNs and Transformers Xijun Wang Xiaojie Chu Chunrui Han Xiangyu Zhang ViT 18 1 0 14 Aug 2023
LEST: Large-scale LiDAR Semantic Segmentation with Transformer Chuanyu Luo Nuo Cheng Sikun Ma Han Li Xiaohan Li Shengguang Lei Pu Li 3DPC ViT 17 2 0 14 Jul 2023
The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit Lorenzo Noci Chuning Li Mufan Bill Li Bobby He Thomas Hofmann Chris J. Maddison Daniel M. Roy 30 29 0 30 Jun 2023
A generic self-supervised learning (SSL) framework for representation learning from spectra-spatial feature of unlabeled remote sensing imagery Xin Zhang Liangxiu Han SSL 16 2 0 27 Jun 2023
Max-Margin Token Selection in Attention Mechanism Davoud Ataee Tarzanagh Yingcong Li Xuechen Zhang Samet Oymak 32 38 0 23 Jun 2023
On the Role of Attention in Prompt-tuning Samet Oymak A. S. Rawat Mahdi Soltanolkotabi Christos Thrampoulidis MLT LRM 20 41 0 06 Jun 2023
Towards Deep Attention in Graph Neural Networks: Problems and Remedies Soo Yong Lee Fanchen Bu Jaemin Yoo Kijung Shin GNN 11 30 0 04 Jun 2023
Memorization Capacity of Multi-Head Attention in Transformers Sadegh Mahdavi Renjie Liao Christos Thrampoulidis 22 22 0 03 Jun 2023
Universality and Limitations of Prompt Tuning Yihan Wang Jatin Chauhan Wei Wang Cho-Jui Hsieh 37 17 0 30 May 2023
On the impact of activation and normalization in obtaining isometric embeddings at initialization Amir Joudaki Hadi Daneshmand Francis R. Bach 11 9 0 28 May 2023
Scalable Transformer for PDE Surrogate Modeling Zijie Li Dule Shu A. Farimani 24 63 0 27 May 2023
Investigating the Role of Feed-Forward Networks in Transformers Using Parallel Attention and Feed-Forward Net Design Shashank Sonkar Richard G. Baraniuk 11 2 0 22 May 2023
The emergence of clusters in self-attention dynamics Borjan Geshkovski Cyril Letrouit Yury Polyanskiy Philippe Rigollet 22 46 0 09 May 2023
Medical SAM Adapter: Adapting Segment Anything Model for Medical Image Segmentation Junde Wu Rao Fu Yuanpei Liu Huihui Fang Zhao-Yang Wang Yanwu Xu Yueming Jin VLM MedIm 39 464 0 25 Apr 2023
Causal Decision Transformer for Recommender Systems via Offline Reinforcement Learning Siyu Wang Xiaocong Chen Dietmar Jannach Lina Yao CML OffRL 11 27 0 17 Apr 2023
Decoder-Only or Encoder-Decoder? Interpreting Language Model as a Regularized Encoder-Decoder Z. Fu W. Lam Qian Yu Anthony Man-Cho So Shengding Hu Zhiyuan Liu Nigel Collier AuLLM 28 41 0 08 Apr 2023
TriDet: Temporal Action Detection with Relative Boundary Modeling Ding Shi Yujie Zhong Qiong Cao Lin Ma Jia Li Dacheng Tao ViT 20 126 0 13 Mar 2023
Stabilizing Transformer Training by Preventing Attention Entropy Collapse Shuangfei Zhai Tatiana Likhomanenko Etai Littwin Dan Busbridge Jason Ramapuram Yizhe Zhang Jiatao Gu J. Susskind AAML 38 64 0 11 Mar 2023
A Message Passing Perspective on Learning Dynamics of Contrastive Learning Yifei Wang Qi Zhang Tianqi Du Jiansheng Yang Zhouchen Lin Yisen Wang SSL 24 18 0 08 Mar 2023
Are More Layers Beneficial to Graph Transformers? Haiteng Zhao Shuming Ma Dongdong Zhang Zhi-Hong Deng Furu Wei 27 12 0 01 Mar 2023
Multi-Layer Attention-Based Explainability via Transformers for Tabular Data Andrea Trevino Gavito Diego Klabjan J. Utke LMTD 15 3 0 28 Feb 2023
A Brief Survey on the Approximation Theory for Sequence Modelling Hao Jiang Qianxiao Li Zhong Li Shida Wang AI4TS 13 12 0 27 Feb 2023
Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation Bobby He James Martens Guodong Zhang Aleksandar Botev Andy Brock Samuel L. Smith Yee Whye Teh 17 30 0 20 Feb 2023
Hyneter: Hybrid Network Transformer for Object Detection Dong Chen Duoqian Miao Xuepeng Zhao ViT 27 3 0 18 Feb 2023
A Theoretical Understanding of Shallow Vision Transformers: Learning, Generalization, and Sample Complexity Hongkang Li M. Wang Sijia Liu Pin-Yu Chen ViT MLT 35 56 0 12 Feb 2023