ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2103.03404
  4. Cited By
Attention is Not All You Need: Pure Attention Loses Rank Doubly
  Exponentially with Depth

Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth

5 March 2021
Yihe Dong
Jean-Baptiste Cordonnier
Andreas Loukas
ArXivPDFHTML

Papers citing "Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth"

50 / 238 papers shown
Title
Representation Deficiency in Masked Language Modeling
Representation Deficiency in Masked Language Modeling
Yu Meng
Jitin Krishnan
Sinong Wang
Qifan Wang
Yuning Mao
Han Fang
Marjan Ghazvininejad
Jiawei Han
Luke Zettlemoyer
71
7
0
04 Feb 2023
When Layers Play the Lottery, all Tickets Win at Initialization
When Layers Play the Lottery, all Tickets Win at Initialization
Artur Jordão
George Correa de Araujo
H. Maia
Hélio Pedrini
11
3
0
25 Jan 2023
A Close Look at Spatial Modeling: From Attention to Convolution
A Close Look at Spatial Modeling: From Attention to Convolution
Xu Ma
Huan Wang
Can Qin
Kunpeng Li
Xing Zhao
Jie Fu
Yun Fu
ViT
3DPC
17
11
0
23 Dec 2022
EIT: Enhanced Interactive Transformer
EIT: Enhanced Interactive Transformer
Tong Zheng
Bei Li
Huiwen Bao
Tong Xiao
Jingbo Zhu
24
2
0
20 Dec 2022
Non-equispaced Fourier Neural Solvers for PDEs
Non-equispaced Fourier Neural Solvers for PDEs
Haitao Lin
Lirong Wu
Yongjie Xu
Yufei Huang
Siyuan Li
Guojiang Zhao
Z. Stan
14
7
0
09 Dec 2022
A K-variate Time Series Is Worth K Words: Evolution of the Vanilla
  Transformer Architecture for Long-term Multivariate Time Series Forecasting
A K-variate Time Series Is Worth K Words: Evolution of the Vanilla Transformer Architecture for Long-term Multivariate Time Series Forecasting
Zanwei Zhou
Rui-Ming Zhong
Chen Yang
Yan Wang
Xiaokang Yang
Wei Shen
AI4TS
31
9
0
06 Dec 2022
Spatial-Spectral Transformer for Hyperspectral Image Denoising
Spatial-Spectral Transformer for Hyperspectral Image Denoising
Miaoyu Li
Ying Fu
Yulun Zhang
14
67
0
25 Nov 2022
Beyond Attentive Tokens: Incorporating Token Importance and Diversity
  for Efficient Vision Transformers
Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers
Sifan Long
Z. Zhao
Jimin Pi
Sheng-sheng Wang
Jingdong Wang
14
29
0
21 Nov 2022
Convexifying Transformers: Improving optimization and understanding of
  transformer networks
Convexifying Transformers: Improving optimization and understanding of transformer networks
Tolga Ergen
Behnam Neyshabur
Harsh Mehta
MLT
31
15
0
20 Nov 2022
Finding Skill Neurons in Pre-trained Transformer-based Language Models
Finding Skill Neurons in Pre-trained Transformer-based Language Models
Xiaozhi Wang
Kaiyue Wen
Zhengyan Zhang
Lei Hou
Zhiyuan Liu
Juanzi Li
MILM
MoE
19
50
0
14 Nov 2022
AD-DROP: Attribution-Driven Dropout for Robust Language Model
  Fine-Tuning
AD-DROP: Attribution-Driven Dropout for Robust Language Model Fine-Tuning
Tao Yang
Jinghao Deng
Xiaojun Quan
Qifan Wang
Shaoliang Nie
25
3
0
12 Oct 2022
SML:Enhance the Network Smoothness with Skip Meta Logit for CTR
  Prediction
SML:Enhance the Network Smoothness with Skip Meta Logit for CTR Prediction
Wenlong Deng
Lang Lang
Z. Liu
B. Liu
16
0
0
09 Oct 2022
In-context Learning and Induction Heads
In-context Learning and Induction Heads
Catherine Olsson
Nelson Elhage
Neel Nanda
Nicholas Joseph
Nova Dassarma
...
Tom B. Brown
Jack Clark
Jared Kaplan
Sam McCandlish
C. Olah
242
458
0
24 Sep 2022
On The Computational Complexity of Self-Attention
On The Computational Complexity of Self-Attention
Feyza Duman Keles
Pruthuvi Maheshakya Wijewardena
C. Hegde
63
108
0
11 Sep 2022
Pre-Training a Graph Recurrent Network for Language Representation
Pre-Training a Graph Recurrent Network for Language Representation
Yile Wang
Linyi Yang
Zhiyang Teng
M. Zhou
Yue Zhang
GNN
25
1
0
08 Sep 2022
Addressing Token Uniformity in Transformers via Singular Value
  Transformation
Addressing Token Uniformity in Transformers via Singular Value Transformation
Hanqi Yan
Lin Gui
Wenjie Li
Yulan He
19
14
0
24 Aug 2022
Exploring Generative Neural Temporal Point Process
Exploring Generative Neural Temporal Point Process
Haitao Lin
Lirong Wu
Guojiang Zhao
Pai Liu
Stan Z. Li
DiffM
10
25
0
03 Aug 2022
Neural Knowledge Bank for Pretrained Transformers
Neural Knowledge Bank for Pretrained Transformers
Damai Dai
Wen-Jie Jiang
Qingxiu Dong
Yajuan Lyu
Qiaoqiao She
Zhifang Sui
KELM
19
21
0
31 Jul 2022
EATFormer: Improving Vision Transformer Inspired by Evolutionary
  Algorithm
EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm
Jiangning Zhang
Xiangtai Li
Yabiao Wang
Chengjie Wang
Yibo Yang
Yong Liu
Dacheng Tao
ViT
30
32
0
19 Jun 2022
Rank Diminishing in Deep Neural Networks
Rank Diminishing in Deep Neural Networks
Ruili Feng
Kecheng Zheng
Yukun Huang
Deli Zhao
Michael I. Jordan
Zhengjun Zha
18
28
0
13 Jun 2022
Signal Propagation in Transformers: Theoretical Perspectives and the
  Role of Rank Collapse
Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse
Lorenzo Noci
Sotiris Anagnostidis
Luca Biggio
Antonio Orvieto
Sidak Pal Singh
Aurélien Lucchi
41
65
0
07 Jun 2022
Vision GNN: An Image is Worth Graph of Nodes
Vision GNN: An Image is Worth Graph of Nodes
Kai Han
Yunhe Wang
Jianyuan Guo
Yehui Tang
Enhua Wu
GNN
3DH
10
351
0
01 Jun 2022
Universal Deep GNNs: Rethinking Residual Connection in GNNs from a Path
  Decomposition Perspective for Preventing the Over-smoothing
Universal Deep GNNs: Rethinking Residual Connection in GNNs from a Path Decomposition Perspective for Preventing the Over-smoothing
Jie Chen
Weiqi Liu
Zhizhong Huang
Junbin Gao
Junping Zhang
Jian Pu
21
3
0
30 May 2022
Learning Locality and Isotropy in Dialogue Modeling
Learning Locality and Isotropy in Dialogue Modeling
Han Wu
Hao Hao Tan
Mingjie Zhan
Gangming Zhao
Shaoqing Lu
Ding Liang
Linqi Song
19
2
0
29 May 2022
AdaptFormer: Adapting Vision Transformers for Scalable Visual
  Recognition
AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition
Shoufa Chen
Chongjian Ge
Zhan Tong
Jiangliu Wang
Yibing Song
Jue Wang
Ping Luo
141
637
0
26 May 2022
Your Transformer May Not be as Powerful as You Expect
Your Transformer May Not be as Powerful as You Expect
Shengjie Luo
Shanda Li
Shuxin Zheng
Tie-Yan Liu
Liwei Wang
Di He
52
50
0
26 May 2022
On Bridging the Gap between Mean Field and Finite Width in Deep Random
  Neural Networks with Batch Normalization
On Bridging the Gap between Mean Field and Finite Width in Deep Random Neural Networks with Batch Normalization
Amir Joudaki
Hadi Daneshmand
Francis R. Bach
AI4CE
11
2
0
25 May 2022
Outliers Dimensions that Disrupt Transformers Are Driven by Frequency
Outliers Dimensions that Disrupt Transformers Are Driven by Frequency
Giovanni Puccetti
Anna Rogers
Aleksandr Drozd
F. Dell’Orletta
71
42
0
23 May 2022
A Study on Transformer Configuration and Training Objective
A Study on Transformer Configuration and Training Objective
Fuzhao Xue
Jianghai Chen
Aixin Sun
Xiaozhe Ren
Zangwei Zheng
Xiaoxin He
Yongming Chen
Xin Jiang
Yang You
30
7
0
21 May 2022
Exploring Extreme Parameter Compression for Pre-trained Language Models
Exploring Extreme Parameter Compression for Pre-trained Language Models
Yuxin Ren
Benyou Wang
Lifeng Shang
Xin Jiang
Qun Liu
28
18
0
20 May 2022
Causal Transformer for Estimating Counterfactual Outcomes
Causal Transformer for Estimating Counterfactual Outcomes
Valentyn Melnychuk
Dennis Frauen
Stefan Feuerriegel
CML
30
91
0
14 Apr 2022
Exploiting Temporal Relations on Radar Perception for Autonomous Driving
Exploiting Temporal Relations on Radar Perception for Autonomous Driving
Peizhao Li
Puzuo Wang
K. Berntorp
Hongfu Liu
14
43
0
03 Apr 2022
Training-free Transformer Architecture Search
Training-free Transformer Architecture Search
Qinqin Zhou
Kekai Sheng
Xiawu Zheng
Ke Li
Xing Sun
Yonghong Tian
Jie Chen
Rongrong Ji
ViT
32
46
0
23 Mar 2022
Unified Visual Transformer Compression
Unified Visual Transformer Compression
Shixing Yu
Tianlong Chen
Jiayi Shen
Huan Yuan
Jianchao Tan
Sen Yang
Ji Liu
Zhangyang Wang
ViT
14
91
0
15 Mar 2022
Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs
Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs
Xiaohan Ding
X. Zhang
Yi Zhou
Jungong Han
Guiguang Ding
Jian-jun Sun
VLM
47
525
0
13 Mar 2022
The Principle of Diversity: Training Stronger Vision Transformers Calls
  for Reducing All Levels of Redundancy
The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy
Tianlong Chen
Zhenyu (Allen) Zhang
Yu Cheng
Ahmed Hassan Awadallah
Zhangyang Wang
ViT
27
37
0
12 Mar 2022
Block-Recurrent Transformers
Block-Recurrent Transformers
DeLesley S. Hutchins
Imanol Schlag
Yuhuai Wu
Ethan Dyer
Behnam Neyshabur
16
94
0
11 Mar 2022
Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain
  Analysis: From Theory to Practice
Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to Practice
Peihao Wang
Wenqing Zheng
Tianlong Chen
Zhangyang Wang
ViT
14
127
0
09 Mar 2022
Sky Computing: Accelerating Geo-distributed Computing in Federated
  Learning
Sky Computing: Accelerating Geo-distributed Computing in Federated Learning
Jie Zhu
Shenggui Li
Yang You
FedML
8
5
0
24 Feb 2022
Revisiting Over-smoothing in BERT from the Perspective of Graph
Revisiting Over-smoothing in BERT from the Perspective of Graph
Han Shi
Jiahui Gao
Hang Xu
Xiaodan Liang
Zhenguo Li
Lingpeng Kong
Stephen M. S. Lee
James T. Kwok
22
71
0
17 Feb 2022
The Quarks of Attention
The Quarks of Attention
Pierre Baldi
Roman Vershynin
GNN
11
9
0
15 Feb 2022
On the Origins of the Block Structure Phenomenon in Neural Network
  Representations
On the Origins of the Block Structure Phenomenon in Neural Network Representations
Thao Nguyen
M. Raghu
Simon Kornblith
11
14
0
15 Feb 2022
Video Transformers: A Survey
Video Transformers: A Survey
Javier Selva
A. S. Johansen
Sergio Escalera
Kamal Nasrollahi
T. Moeslund
Albert Clapés
ViT
20
102
0
16 Jan 2022
A Survey of Controllable Text Generation using Transformer-based
  Pre-trained Language Models
A Survey of Controllable Text Generation using Transformer-based Pre-trained Language Models
Hanqing Zhang
Haolin Song
Shaoyu Li
Ming Zhou
Dawei Song
38
213
0
14 Jan 2022
Miti-DETR: Object Detection based on Transformers with Mitigatory
  Self-Attention Convergence
Miti-DETR: Object Detection based on Transformers with Mitigatory Self-Attention Convergence
Wenchi Ma
Tianxiao Zhang
Guanghui Wang
ViT
28
14
0
26 Dec 2021
Sketching as a Tool for Understanding and Accelerating Self-attention
  for Long Sequences
Sketching as a Tool for Understanding and Accelerating Self-attention for Long Sequences
Yifan Chen
Qi Zeng
Dilek Z. Hakkani-Tür
Di Jin
Heng Ji
Yun Yang
25
4
0
10 Dec 2021
Dynamic Graph Learning-Neural Network for Multivariate Time Series
  Modeling
Dynamic Graph Learning-Neural Network for Multivariate Time Series Modeling
Zhuoling Li
Gaowei Zhang
Lingyu Xu
Jie Yu
AI4TS
11
2
0
06 Dec 2021
Graph Conditioned Sparse-Attention for Improved Source Code
  Understanding
Graph Conditioned Sparse-Attention for Improved Source Code Understanding
Junyan Cheng
Iordanis Fostiropoulos
Barry W. Boehm
19
1
0
01 Dec 2021
Pruning Self-attentions into Convolutional Layers in Single Path
Pruning Self-attentions into Convolutional Layers in Single Path
Haoyu He
Jianfei Cai
Jing Liu
Zizheng Pan
Jing Zhang
Dacheng Tao
Bohan Zhuang
ViT
29
40
0
23 Nov 2021
MetaFormer Is Actually What You Need for Vision
MetaFormer Is Actually What You Need for Vision
Weihao Yu
Mi Luo
Pan Zhou
Chenyang Si
Yichen Zhou
Xinchao Wang
Jiashi Feng
Shuicheng Yan
26
868
0
22 Nov 2021
Previous
12345
Next