Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2103.03404
Cited By
Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth
5 March 2021
Yihe Dong
Jean-Baptiste Cordonnier
Andreas Loukas
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth"
50 / 238 papers shown
Title
Characterizing Large Language Model Geometry Helps Solve Toxicity Detection and Generation
Randall Balestriero
Romain Cosentino
Sarath Shekkizhar
28
2
0
04 Dec 2023
Bootstrapping Interactive Image-Text Alignment for Remote Sensing Image Captioning
Cong Yang
Zuchao Li
Lefei Zhang
29
23
0
02 Dec 2023
Pointer Networks Trained Better via Evolutionary Algorithms
Muyao Zhong
Shengcai Liu
Bingdong Li
Haobo Fu
Ke Tang
Peng Yang
23
0
0
02 Dec 2023
Mitigating Over-smoothing in Transformers via Regularized Nonlocal Functionals
Tam Nguyen
Tan-Minh Nguyen
Richard G. Baraniuk
21
8
0
01 Dec 2023
SCHEME: Scalable Channel Mixer for Vision Transformers
Deepak Sridhar
Yunsheng Li
Nuno Vasconcelos
18
0
0
01 Dec 2023
Probabilistic Transformer: A Probabilistic Dependency Model for Contextual Word Representation
Haoyi Wu
Kewei Tu
99
3
0
26 Nov 2023
p-Laplacian Transformer
Tuan Nguyen
Tam Nguyen
Vinh-Tiep Nguyen
Tan-Minh Nguyen
69
0
0
06 Nov 2023
Simplifying Transformer Blocks
Bobby He
Thomas Hofmann
19
30
0
03 Nov 2023
Sliceformer: Make Multi-head Attention as Simple as Sorting in Discriminative Tasks
Shen Yuan
Hongteng Xu
16
0
0
26 Oct 2023
Circuit as Set of Points
Jialv Zou
Xinggang Wang
Jiahao Guo
Wenyu Liu
Qian Zhang
Chang Huang
GNN
3DV
3DPC
23
0
0
26 Oct 2023
Unraveling Feature Extraction Mechanisms in Neural Networks
Xiaobing Sun
Jiaxi Li
Wei Lu
18
0
0
25 Oct 2023
PartialFormer: Modeling Part Instead of Whole for Machine Translation
Tong Zheng
Bei Li
Huiwen Bao
Jiale Wang
Weiqiao Shan
Tong Xiao
Jingbo Zhu
MoE
AI4CE
11
0
0
23 Oct 2023
Eureka-Moments in Transformers: Multi-Step Tasks Reveal Softmax Induced Optimization Problems
David T. Hoffmann
Simon Schrodi
Jelena Bratulić
Nadine Behrmann
Volker Fischer
Thomas Brox
30
5
0
19 Oct 2023
On the Optimization and Generalization of Multi-head Attention
Puneesh Deora
Rouzbeh Ghaderi
Hossein Taheri
Christos Thrampoulidis
MLT
39
33
0
19 Oct 2023
Language Models are Universal Embedders
Xin Zhang
Zehan Li
Yanzhao Zhang
Dingkun Long
Pengjun Xie
Meishan Zhang
Min Zhang
KELM
ELM
35
6
0
12 Oct 2023
Towards Training Without Depth Limits: Batch Normalization Without Gradient Explosion
Alexandru Meterez
Amir Joudaki
Francesco Orabona
Alexander Immer
Gunnar Rätsch
Hadi Daneshmand
24
8
0
03 Oct 2023
Transformers are efficient hierarchical chemical graph learners
Zihan Pengmei
Zimu Li
Chih-chan Tien
Risi Kondor
Aaron R Dinner
GNN
21
1
0
02 Oct 2023
Symmetry Induces Structure and Constraint of Learning
Liu Ziyin
26
10
0
29 Sep 2023
RBFormer: Improve Adversarial Robustness of Transformer by Robust Bias
Hao Cheng
Jinhao Duan
Hui Li
Lyutianyang Zhang
Jiahang Cao
Ping Wang
Jize Zhang
Kaidi Xu
Renjing Xu
AAML
21
3
0
23 Sep 2023
Attention-Only Transformers and Implementing MLPs with Attention Heads
R. Huben
Valerie Morris
11
0
0
15 Sep 2023
Temporal Action Localization with Enhanced Instant Discriminability
Ding Shi
Qiong Cao
Yujie Zhong
Shan An
Jian Cheng
Haogang Zhu
Dacheng Tao
27
9
0
11 Sep 2023
Transformers as Support Vector Machines
Davoud Ataee Tarzanagh
Yingcong Li
Christos Thrampoulidis
Samet Oymak
35
43
0
31 Aug 2023
Rank Collapse Causes Over-Smoothing and Over-Correlation in Graph Neural Networks
Andreas Roth
Thomas Liebig
29
11
0
31 Aug 2023
Self-Feedback DETR for Temporal Action Detection
Jihwan Kim
Miso Lee
Jae-Pil Heo
37
17
0
21 Aug 2023
The Costly Dilemma: Generalization, Evaluation and Cost-Optimal Deployment of Large Language Models
Abi Aryan
Aakash Kumar Nain
Andrew McMahon
Lucas Augusto Meyer
Harpreet Sahota
22
6
0
15 Aug 2023
SCSC: Spatial Cross-scale Convolution Module to Strengthen both CNNs and Transformers
Xijun Wang
Xiaojie Chu
Chunrui Han
Xiangyu Zhang
ViT
18
1
0
14 Aug 2023
LEST: Large-scale LiDAR Semantic Segmentation with Transformer
Chuanyu Luo
Nuo Cheng
Sikun Ma
Han Li
Xiaohan Li
Shengguang Lei
Pu Li
3DPC
ViT
17
2
0
14 Jul 2023
The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit
Lorenzo Noci
Chuning Li
Mufan Bill Li
Bobby He
Thomas Hofmann
Chris J. Maddison
Daniel M. Roy
30
29
0
30 Jun 2023
A generic self-supervised learning (SSL) framework for representation learning from spectra-spatial feature of unlabeled remote sensing imagery
Xin Zhang
Liangxiu Han
SSL
16
2
0
27 Jun 2023
Max-Margin Token Selection in Attention Mechanism
Davoud Ataee Tarzanagh
Yingcong Li
Xuechen Zhang
Samet Oymak
32
38
0
23 Jun 2023
On the Role of Attention in Prompt-tuning
Samet Oymak
A. S. Rawat
Mahdi Soltanolkotabi
Christos Thrampoulidis
MLT
LRM
20
41
0
06 Jun 2023
Towards Deep Attention in Graph Neural Networks: Problems and Remedies
Soo Yong Lee
Fanchen Bu
Jaemin Yoo
Kijung Shin
GNN
11
30
0
04 Jun 2023
Memorization Capacity of Multi-Head Attention in Transformers
Sadegh Mahdavi
Renjie Liao
Christos Thrampoulidis
22
22
0
03 Jun 2023
Universality and Limitations of Prompt Tuning
Yihan Wang
Jatin Chauhan
Wei Wang
Cho-Jui Hsieh
37
17
0
30 May 2023
On the impact of activation and normalization in obtaining isometric embeddings at initialization
Amir Joudaki
Hadi Daneshmand
Francis R. Bach
11
9
0
28 May 2023
Scalable Transformer for PDE Surrogate Modeling
Zijie Li
Dule Shu
A. Farimani
24
63
0
27 May 2023
Investigating the Role of Feed-Forward Networks in Transformers Using Parallel Attention and Feed-Forward Net Design
Shashank Sonkar
Richard G. Baraniuk
11
2
0
22 May 2023
The emergence of clusters in self-attention dynamics
Borjan Geshkovski
Cyril Letrouit
Yury Polyanskiy
Philippe Rigollet
22
46
0
09 May 2023
Medical SAM Adapter: Adapting Segment Anything Model for Medical Image Segmentation
Junde Wu
Rao Fu
Yuanpei Liu
Huihui Fang
Zhao-Yang Wang
Yanwu Xu
Yueming Jin
VLM
MedIm
39
464
0
25 Apr 2023
Causal Decision Transformer for Recommender Systems via Offline Reinforcement Learning
Siyu Wang
Xiaocong Chen
Dietmar Jannach
Lina Yao
CML
OffRL
11
27
0
17 Apr 2023
Decoder-Only or Encoder-Decoder? Interpreting Language Model as a Regularized Encoder-Decoder
Z. Fu
W. Lam
Qian Yu
Anthony Man-Cho So
Shengding Hu
Zhiyuan Liu
Nigel Collier
AuLLM
28
41
0
08 Apr 2023
TriDet: Temporal Action Detection with Relative Boundary Modeling
Ding Shi
Yujie Zhong
Qiong Cao
Lin Ma
Jia Li
Dacheng Tao
ViT
20
126
0
13 Mar 2023
Stabilizing Transformer Training by Preventing Attention Entropy Collapse
Shuangfei Zhai
Tatiana Likhomanenko
Etai Littwin
Dan Busbridge
Jason Ramapuram
Yizhe Zhang
Jiatao Gu
J. Susskind
AAML
38
64
0
11 Mar 2023
A Message Passing Perspective on Learning Dynamics of Contrastive Learning
Yifei Wang
Qi Zhang
Tianqi Du
Jiansheng Yang
Zhouchen Lin
Yisen Wang
SSL
24
18
0
08 Mar 2023
Are More Layers Beneficial to Graph Transformers?
Haiteng Zhao
Shuming Ma
Dongdong Zhang
Zhi-Hong Deng
Furu Wei
27
12
0
01 Mar 2023
Multi-Layer Attention-Based Explainability via Transformers for Tabular Data
Andrea Trevino Gavito
Diego Klabjan
J. Utke
LMTD
15
3
0
28 Feb 2023
A Brief Survey on the Approximation Theory for Sequence Modelling
Hao Jiang
Qianxiao Li
Zhong Li
Shida Wang
AI4TS
13
12
0
27 Feb 2023
Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation
Bobby He
James Martens
Guodong Zhang
Aleksandar Botev
Andy Brock
Samuel L. Smith
Yee Whye Teh
17
30
0
20 Feb 2023
Hyneter: Hybrid Network Transformer for Object Detection
Dong Chen
Duoqian Miao
Xuepeng Zhao
ViT
27
3
0
18 Feb 2023
A Theoretical Understanding of Shallow Vision Transformers: Learning, Generalization, and Sample Complexity
Hongkang Li
M. Wang
Sijia Liu
Pin-Yu Chen
ViT
MLT
35
56
0
12 Feb 2023
Previous
1
2
3
4
5
Next