ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2103.03404
  4. Cited By
Attention is Not All You Need: Pure Attention Loses Rank Doubly
  Exponentially with Depth

Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth

5 March 2021
Yihe Dong
Jean-Baptiste Cordonnier
Andreas Loukas
ArXivPDFHTML

Papers citing "Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth"

50 / 238 papers shown
Title
Characterizing Large Language Model Geometry Helps Solve Toxicity
  Detection and Generation
Characterizing Large Language Model Geometry Helps Solve Toxicity Detection and Generation
Randall Balestriero
Romain Cosentino
Sarath Shekkizhar
28
2
0
04 Dec 2023
Bootstrapping Interactive Image-Text Alignment for Remote Sensing Image
  Captioning
Bootstrapping Interactive Image-Text Alignment for Remote Sensing Image Captioning
Cong Yang
Zuchao Li
Lefei Zhang
29
23
0
02 Dec 2023
Pointer Networks Trained Better via Evolutionary Algorithms
Pointer Networks Trained Better via Evolutionary Algorithms
Muyao Zhong
Shengcai Liu
Bingdong Li
Haobo Fu
Ke Tang
Peng Yang
23
0
0
02 Dec 2023
Mitigating Over-smoothing in Transformers via Regularized Nonlocal
  Functionals
Mitigating Over-smoothing in Transformers via Regularized Nonlocal Functionals
Tam Nguyen
Tan-Minh Nguyen
Richard G. Baraniuk
21
8
0
01 Dec 2023
SCHEME: Scalable Channel Mixer for Vision Transformers
SCHEME: Scalable Channel Mixer for Vision Transformers
Deepak Sridhar
Yunsheng Li
Nuno Vasconcelos
18
0
0
01 Dec 2023
Probabilistic Transformer: A Probabilistic Dependency Model for
  Contextual Word Representation
Probabilistic Transformer: A Probabilistic Dependency Model for Contextual Word Representation
Haoyi Wu
Kewei Tu
99
3
0
26 Nov 2023
p-Laplacian Transformer
p-Laplacian Transformer
Tuan Nguyen
Tam Nguyen
Vinh-Tiep Nguyen
Tan-Minh Nguyen
69
0
0
06 Nov 2023
Simplifying Transformer Blocks
Simplifying Transformer Blocks
Bobby He
Thomas Hofmann
19
30
0
03 Nov 2023
Sliceformer: Make Multi-head Attention as Simple as Sorting in
  Discriminative Tasks
Sliceformer: Make Multi-head Attention as Simple as Sorting in Discriminative Tasks
Shen Yuan
Hongteng Xu
16
0
0
26 Oct 2023
Circuit as Set of Points
Circuit as Set of Points
Jialv Zou
Xinggang Wang
Jiahao Guo
Wenyu Liu
Qian Zhang
Chang Huang
GNN
3DV
3DPC
23
0
0
26 Oct 2023
Unraveling Feature Extraction Mechanisms in Neural Networks
Unraveling Feature Extraction Mechanisms in Neural Networks
Xiaobing Sun
Jiaxi Li
Wei Lu
18
0
0
25 Oct 2023
PartialFormer: Modeling Part Instead of Whole for Machine Translation
PartialFormer: Modeling Part Instead of Whole for Machine Translation
Tong Zheng
Bei Li
Huiwen Bao
Jiale Wang
Weiqiao Shan
Tong Xiao
Jingbo Zhu
MoE
AI4CE
11
0
0
23 Oct 2023
Eureka-Moments in Transformers: Multi-Step Tasks Reveal Softmax Induced
  Optimization Problems
Eureka-Moments in Transformers: Multi-Step Tasks Reveal Softmax Induced Optimization Problems
David T. Hoffmann
Simon Schrodi
Jelena Bratulić
Nadine Behrmann
Volker Fischer
Thomas Brox
30
5
0
19 Oct 2023
On the Optimization and Generalization of Multi-head Attention
On the Optimization and Generalization of Multi-head Attention
Puneesh Deora
Rouzbeh Ghaderi
Hossein Taheri
Christos Thrampoulidis
MLT
39
33
0
19 Oct 2023
Language Models are Universal Embedders
Language Models are Universal Embedders
Xin Zhang
Zehan Li
Yanzhao Zhang
Dingkun Long
Pengjun Xie
Meishan Zhang
Min Zhang
KELM
ELM
35
6
0
12 Oct 2023
Towards Training Without Depth Limits: Batch Normalization Without
  Gradient Explosion
Towards Training Without Depth Limits: Batch Normalization Without Gradient Explosion
Alexandru Meterez
Amir Joudaki
Francesco Orabona
Alexander Immer
Gunnar Rätsch
Hadi Daneshmand
24
8
0
03 Oct 2023
Transformers are efficient hierarchical chemical graph learners
Transformers are efficient hierarchical chemical graph learners
Zihan Pengmei
Zimu Li
Chih-chan Tien
Risi Kondor
Aaron R Dinner
GNN
21
1
0
02 Oct 2023
Symmetry Induces Structure and Constraint of Learning
Symmetry Induces Structure and Constraint of Learning
Liu Ziyin
26
10
0
29 Sep 2023
RBFormer: Improve Adversarial Robustness of Transformer by Robust Bias
RBFormer: Improve Adversarial Robustness of Transformer by Robust Bias
Hao Cheng
Jinhao Duan
Hui Li
Lyutianyang Zhang
Jiahang Cao
Ping Wang
Jize Zhang
Kaidi Xu
Renjing Xu
AAML
21
3
0
23 Sep 2023
Attention-Only Transformers and Implementing MLPs with Attention Heads
Attention-Only Transformers and Implementing MLPs with Attention Heads
R. Huben
Valerie Morris
11
0
0
15 Sep 2023
Temporal Action Localization with Enhanced Instant Discriminability
Temporal Action Localization with Enhanced Instant Discriminability
Ding Shi
Qiong Cao
Yujie Zhong
Shan An
Jian Cheng
Haogang Zhu
Dacheng Tao
27
9
0
11 Sep 2023
Transformers as Support Vector Machines
Transformers as Support Vector Machines
Davoud Ataee Tarzanagh
Yingcong Li
Christos Thrampoulidis
Samet Oymak
35
43
0
31 Aug 2023
Rank Collapse Causes Over-Smoothing and Over-Correlation in Graph Neural
  Networks
Rank Collapse Causes Over-Smoothing and Over-Correlation in Graph Neural Networks
Andreas Roth
Thomas Liebig
29
11
0
31 Aug 2023
Self-Feedback DETR for Temporal Action Detection
Self-Feedback DETR for Temporal Action Detection
Jihwan Kim
Miso Lee
Jae-Pil Heo
37
17
0
21 Aug 2023
The Costly Dilemma: Generalization, Evaluation and Cost-Optimal
  Deployment of Large Language Models
The Costly Dilemma: Generalization, Evaluation and Cost-Optimal Deployment of Large Language Models
Abi Aryan
Aakash Kumar Nain
Andrew McMahon
Lucas Augusto Meyer
Harpreet Sahota
22
6
0
15 Aug 2023
SCSC: Spatial Cross-scale Convolution Module to Strengthen both CNNs and
  Transformers
SCSC: Spatial Cross-scale Convolution Module to Strengthen both CNNs and Transformers
Xijun Wang
Xiaojie Chu
Chunrui Han
Xiangyu Zhang
ViT
18
1
0
14 Aug 2023
LEST: Large-scale LiDAR Semantic Segmentation with Transformer
LEST: Large-scale LiDAR Semantic Segmentation with Transformer
Chuanyu Luo
Nuo Cheng
Sikun Ma
Han Li
Xiaohan Li
Shengguang Lei
Pu Li
3DPC
ViT
17
2
0
14 Jul 2023
The Shaped Transformer: Attention Models in the Infinite Depth-and-Width
  Limit
The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit
Lorenzo Noci
Chuning Li
Mufan Bill Li
Bobby He
Thomas Hofmann
Chris J. Maddison
Daniel M. Roy
30
29
0
30 Jun 2023
A generic self-supervised learning (SSL) framework for representation
  learning from spectra-spatial feature of unlabeled remote sensing imagery
A generic self-supervised learning (SSL) framework for representation learning from spectra-spatial feature of unlabeled remote sensing imagery
Xin Zhang
Liangxiu Han
SSL
16
2
0
27 Jun 2023
Max-Margin Token Selection in Attention Mechanism
Max-Margin Token Selection in Attention Mechanism
Davoud Ataee Tarzanagh
Yingcong Li
Xuechen Zhang
Samet Oymak
32
38
0
23 Jun 2023
On the Role of Attention in Prompt-tuning
On the Role of Attention in Prompt-tuning
Samet Oymak
A. S. Rawat
Mahdi Soltanolkotabi
Christos Thrampoulidis
MLT
LRM
20
41
0
06 Jun 2023
Towards Deep Attention in Graph Neural Networks: Problems and Remedies
Towards Deep Attention in Graph Neural Networks: Problems and Remedies
Soo Yong Lee
Fanchen Bu
Jaemin Yoo
Kijung Shin
GNN
11
30
0
04 Jun 2023
Memorization Capacity of Multi-Head Attention in Transformers
Memorization Capacity of Multi-Head Attention in Transformers
Sadegh Mahdavi
Renjie Liao
Christos Thrampoulidis
22
22
0
03 Jun 2023
Universality and Limitations of Prompt Tuning
Universality and Limitations of Prompt Tuning
Yihan Wang
Jatin Chauhan
Wei Wang
Cho-Jui Hsieh
37
17
0
30 May 2023
On the impact of activation and normalization in obtaining isometric
  embeddings at initialization
On the impact of activation and normalization in obtaining isometric embeddings at initialization
Amir Joudaki
Hadi Daneshmand
Francis R. Bach
11
9
0
28 May 2023
Scalable Transformer for PDE Surrogate Modeling
Scalable Transformer for PDE Surrogate Modeling
Zijie Li
Dule Shu
A. Farimani
24
63
0
27 May 2023
Investigating the Role of Feed-Forward Networks in Transformers Using
  Parallel Attention and Feed-Forward Net Design
Investigating the Role of Feed-Forward Networks in Transformers Using Parallel Attention and Feed-Forward Net Design
Shashank Sonkar
Richard G. Baraniuk
11
2
0
22 May 2023
The emergence of clusters in self-attention dynamics
The emergence of clusters in self-attention dynamics
Borjan Geshkovski
Cyril Letrouit
Yury Polyanskiy
Philippe Rigollet
22
46
0
09 May 2023
Medical SAM Adapter: Adapting Segment Anything Model for Medical Image
  Segmentation
Medical SAM Adapter: Adapting Segment Anything Model for Medical Image Segmentation
Junde Wu
Rao Fu
Yuanpei Liu
Huihui Fang
Zhao-Yang Wang
Yanwu Xu
Yueming Jin
VLM
MedIm
39
464
0
25 Apr 2023
Causal Decision Transformer for Recommender Systems via Offline
  Reinforcement Learning
Causal Decision Transformer for Recommender Systems via Offline Reinforcement Learning
Siyu Wang
Xiaocong Chen
Dietmar Jannach
Lina Yao
CML
OffRL
11
27
0
17 Apr 2023
Decoder-Only or Encoder-Decoder? Interpreting Language Model as a
  Regularized Encoder-Decoder
Decoder-Only or Encoder-Decoder? Interpreting Language Model as a Regularized Encoder-Decoder
Z. Fu
W. Lam
Qian Yu
Anthony Man-Cho So
Shengding Hu
Zhiyuan Liu
Nigel Collier
AuLLM
28
41
0
08 Apr 2023
TriDet: Temporal Action Detection with Relative Boundary Modeling
TriDet: Temporal Action Detection with Relative Boundary Modeling
Ding Shi
Yujie Zhong
Qiong Cao
Lin Ma
Jia Li
Dacheng Tao
ViT
20
126
0
13 Mar 2023
Stabilizing Transformer Training by Preventing Attention Entropy
  Collapse
Stabilizing Transformer Training by Preventing Attention Entropy Collapse
Shuangfei Zhai
Tatiana Likhomanenko
Etai Littwin
Dan Busbridge
Jason Ramapuram
Yizhe Zhang
Jiatao Gu
J. Susskind
AAML
38
64
0
11 Mar 2023
A Message Passing Perspective on Learning Dynamics of Contrastive
  Learning
A Message Passing Perspective on Learning Dynamics of Contrastive Learning
Yifei Wang
Qi Zhang
Tianqi Du
Jiansheng Yang
Zhouchen Lin
Yisen Wang
SSL
24
18
0
08 Mar 2023
Are More Layers Beneficial to Graph Transformers?
Are More Layers Beneficial to Graph Transformers?
Haiteng Zhao
Shuming Ma
Dongdong Zhang
Zhi-Hong Deng
Furu Wei
27
12
0
01 Mar 2023
Multi-Layer Attention-Based Explainability via Transformers for Tabular
  Data
Multi-Layer Attention-Based Explainability via Transformers for Tabular Data
Andrea Trevino Gavito
Diego Klabjan
J. Utke
LMTD
15
3
0
28 Feb 2023
A Brief Survey on the Approximation Theory for Sequence Modelling
A Brief Survey on the Approximation Theory for Sequence Modelling
Hao Jiang
Qianxiao Li
Zhong Li
Shida Wang
AI4TS
13
12
0
27 Feb 2023
Deep Transformers without Shortcuts: Modifying Self-attention for
  Faithful Signal Propagation
Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation
Bobby He
James Martens
Guodong Zhang
Aleksandar Botev
Andy Brock
Samuel L. Smith
Yee Whye Teh
17
30
0
20 Feb 2023
Hyneter: Hybrid Network Transformer for Object Detection
Hyneter: Hybrid Network Transformer for Object Detection
Dong Chen
Duoqian Miao
Xuepeng Zhao
ViT
27
3
0
18 Feb 2023
A Theoretical Understanding of Shallow Vision Transformers: Learning,
  Generalization, and Sample Complexity
A Theoretical Understanding of Shallow Vision Transformers: Learning, Generalization, and Sample Complexity
Hongkang Li
M. Wang
Sijia Liu
Pin-Yu Chen
ViT
MLT
35
56
0
12 Feb 2023
Previous
12345
Next