ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2103.03404
  4. Cited By
Attention is Not All You Need: Pure Attention Loses Rank Doubly
  Exponentially with Depth

Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth

5 March 2021
Yihe Dong
Jean-Baptiste Cordonnier
Andreas Loukas
ArXivPDFHTML

Papers citing "Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth"

50 / 238 papers shown
Title
Always Skip Attention
Always Skip Attention
Yiping Ji
Hemanth Saratchandran
Peyman Moghaddam
Simon Lucey
101
0
0
04 May 2025
LENSLLM: Unveiling Fine-Tuning Dynamics for LLM Selection
LENSLLM: Unveiling Fine-Tuning Dynamics for LLM Selection
Xinyue Zeng
Haohui Wang
Junhong Lin
Jun Wu
Tyler Cody
Dawei Zhou
69
0
0
01 May 2025
Revisiting Transformers through the Lens of Low Entropy and Dynamic Sparsity
Revisiting Transformers through the Lens of Low Entropy and Dynamic Sparsity
Ruifeng Ren
Yong Liu
94
0
0
26 Apr 2025
MOOSComp: Improving Lightweight Long-Context Compressor via Mitigating Over-Smoothing and Incorporating Outlier Scores
MOOSComp: Improving Lightweight Long-Context Compressor via Mitigating Over-Smoothing and Incorporating Outlier Scores
Fengwei Zhou
Jiafei Song
Wenjin Jason Li
Gengjian Xue
Zhikang Zhao
Yichao Lu
Bailin Na
17
0
0
23 Apr 2025
Quantum Doubly Stochastic Transformers
Quantum Doubly Stochastic Transformers
Jannis Born
Filip Skogh
Kahn Rhrissorrakrai
Filippo Utro
Nico Wagner
Aleksandros Sobczyk
27
0
0
22 Apr 2025
Transformers Can Overcome the Curse of Dimensionality: A Theoretical Study from an Approximation Perspective
Transformers Can Overcome the Curse of Dimensionality: A Theoretical Study from an Approximation Perspective
Yuling Jiao
Yanming Lai
Yang Wang
Bokai Yan
34
0
0
18 Apr 2025
Defending Against Frequency-Based Attacks with Diffusion Models
Defending Against Frequency-Based Attacks with Diffusion Models
Fatemeh Amerehi
Patrick Healy
AAML
28
0
0
15 Apr 2025
DyDiT++: Dynamic Diffusion Transformers for Efficient Visual Generation
DyDiT++: Dynamic Diffusion Transformers for Efficient Visual Generation
Wangbo Zhao
Yizeng Han
Jiasheng Tang
Kai Wang
Hao Luo
Yibing Song
Gao Huang
Fan Wang
Yang You
66
0
0
09 Apr 2025
Fourier Feature Attribution: A New Efficiency Attribution Method
Fourier Feature Attribution: A New Efficiency Attribution Method
Zechen Liu
Feiyang Zhang
Wei Song
X. Li
Wei Wei
FAtt
57
0
0
02 Apr 2025
Filtering with Time-frequency Analysis: An Adaptive and Lightweight Model for Sequential Recommender Systems Based on Discrete Wavelet Transform
Filtering with Time-frequency Analysis: An Adaptive and Lightweight Model for Sequential Recommender Systems Based on Discrete Wavelet Transform
Sheng Lu
Mingxi Ge
Jiuyi Zhang
Wanli Zhu
Guanjin Li
Fangming Gu
AI4TS
56
0
0
30 Mar 2025
Coeff-Tuning: A Graph Filter Subspace View for Tuning Attention-Based Large Models
Coeff-Tuning: A Graph Filter Subspace View for Tuning Attention-Based Large Models
Zichen Miao
Wei Chen
Qiang Qiu
90
1
0
24 Mar 2025
Temporal Action Detection Model Compression by Progressive Block Drop
Temporal Action Detection Model Compression by Progressive Block Drop
Xiaoyong Chen
Yong Guo
Jiaming Liang
Sitong Zhuang
Runhao Zeng
Xiping Hu
43
0
0
21 Mar 2025
Towards Understanding Multi-Round Large Language Model Reasoning: Approximability, Learnability and Generalizability
Chenhui Xu
Dancheng Liu
Jiajie Li
Amir Nassereldine
Zhaohui Li
Jinjun Xiong
LRM
59
0
0
05 Mar 2025
Transformer Meets Twicing: Harnessing Unattended Residual Information
Laziz U. Abdullaev
Tan M. Nguyen
37
2
0
02 Mar 2025
Position: Graph Learning Will Lose Relevance Due To Poor Benchmarks
Position: Graph Learning Will Lose Relevance Due To Poor Benchmarks
Maya Bechler-Speicher
Ben Finkelshtein
Fabrizio Frasca
Luis Muller
Jan Tonshoff
...
Michael M. Bronstein
Mathias Niepert
Bryan Perozzi
Mikhail Galkin
Christopher Morris
OOD
97
2
0
21 Feb 2025
Hyperspherical Energy Transformer with Recurrent Depth
Yunzhe Hu
Difan Zou
Dong Xu
39
0
0
17 Feb 2025
Pre-train and Fine-tune: Recommenders as Large Models
Pre-train and Fine-tune: Recommenders as Large Models
Zhenhao Jiang
C. L. P. Chen
Hao Feng
Yu Yang
Jin Liu
Jie Zhang
Jia Jia
Ning Hu
39
0
0
24 Jan 2025
Approximation Rate of the Transformer Architecture for Sequence Modeling
Approximation Rate of the Transformer Architecture for Sequence Modeling
Hao Jiang
Qianxiao Li
46
9
0
03 Jan 2025
Hadamard Attention Recurrent Transformer: A Strong Baseline for Stereo Matching Transformer
Hadamard Attention Recurrent Transformer: A Strong Baseline for Stereo Matching Transformer
Ziyang Chen
Yongjun Zhang
Wenting Li
Bingshu Wang
Yabo Wu
Yong Zhao
C. L. P. Chen
38
0
0
02 Jan 2025
PointVoxelFormer -- Reviving point cloud networks for 3D medical imaging
PointVoxelFormer -- Reviving point cloud networks for 3D medical imaging
Mattias Paul Heinrich
3DPC
37
0
0
23 Dec 2024
Enhancing Multi-Text Long Video Generation Consistency without Tuning:
  Time-Frequency Analysis, Prompt Alignment, and Theory
Enhancing Multi-Text Long Video Generation Consistency without Tuning: Time-Frequency Analysis, Prompt Alignment, and Theory
Xingyao Li
Fengzhuo Zhang
Jiachun Pan
Yunlong Hou
Vincent Y. F. Tan
Zhuoran Yang
DiffM
VGen
35
0
0
23 Dec 2024
Content-aware Balanced Spectrum Encoding in Masked Modeling for Time
  Series Classification
Content-aware Balanced Spectrum Encoding in Masked Modeling for Time Series Classification
Yudong Han
Haocong Wang
Yupeng Hu
Yongshun Gong
Xuemeng Song
Weili Guan
AI4TS
79
0
0
17 Dec 2024
AsymRnR: Video Diffusion Transformers Acceleration with Asymmetric Reduction and Restoration
AsymRnR: Video Diffusion Transformers Acceleration with Asymmetric Reduction and Restoration
Wenhao Sun
Rong-Cheng Tu
Jingyi Liao
Zhao Jin
Dacheng Tao
VGen
97
1
0
16 Dec 2024
The Asymptotic Behavior of Attention in Transformers
The Asymptotic Behavior of Attention in Transformers
Álvaro Rodríguez Abella
João Pedro Silvestre
Paulo Tabuada
66
3
0
03 Dec 2024
Enhancing Parameter-Efficient Fine-Tuning of Vision Transformers through
  Frequency-Based Adaptation
Enhancing Parameter-Efficient Fine-Tuning of Vision Transformers through Frequency-Based Adaptation
S. Ly
Hien Nguyen
72
1
0
28 Nov 2024
Layer Pruning with Consensus: A Triple-Win Solution
Layer Pruning with Consensus: A Triple-Win Solution
Leandro Giusti Mugnaini
Carolina Tavares Duarte
Anna H. Reali Costa
Artur Jordao
66
0
0
21 Nov 2024
A Theory for Compressibility of Graph Transformers for Transductive
  Learning
A Theory for Compressibility of Graph Transformers for Transductive Learning
Hamed Shirzad
Honghao Lin
A. Velingker
B. Venkatachalam
David P. Woodruff
Danica J. Sutherland
75
1
0
20 Nov 2024
Selective Attention: Enhancing Transformer through Principled Context
  Control
Selective Attention: Enhancing Transformer through Principled Context Control
Xuechen Zhang
Xiangyu Chang
Mingchen Li
A. Roy-Chowdhury
J. Chen
Samet Oymak
73
3
0
19 Nov 2024
Clustering in Causal Attention Masking
Clustering in Causal Attention Masking
Nikita Karagodin
Yury Polyanskiy
Philippe Rigollet
60
5
0
07 Nov 2024
Rethinking Decoders for Transformer-based Semantic Segmentation: A Compression Perspective
Rethinking Decoders for Transformer-based Semantic Segmentation: A Compression Perspective
Qishuai Wen
Chun-Guang Li
ViT
32
0
0
05 Nov 2024
Activating Self-Attention for Multi-Scene Absolute Pose Regression
Activating Self-Attention for Multi-Scene Absolute Pose Regression
Miso Lee
Jihwan Kim
Jae-Pil Heo
ViT
29
0
0
03 Nov 2024
RAM: Replace Attention with MLP for Efficient Multivariate Time Series Forecasting
RAM: Replace Attention with MLP for Efficient Multivariate Time Series Forecasting
Suhan Guo
Jiahong Deng
Yi Wei
Hui Dou
F. Shen
Jian Zhao
AI4TS
103
0
0
31 Oct 2024
LSEAttention is All You Need for Time Series Forecasting
LSEAttention is All You Need for Time Series Forecasting
Dizhen Liang
AI4TS
32
0
0
31 Oct 2024
Preserving Pre-trained Representation Space: On Effectiveness of
  Prefix-tuning for Large Multi-modal Models
Preserving Pre-trained Representation Space: On Effectiveness of Prefix-tuning for Large Multi-modal Models
Donghoon Kim
Gusang Lee
Kyuhong Shim
B. Shim
48
1
0
29 Oct 2024
Provable optimal transport with transformers: The essence of depth and
  prompt engineering
Provable optimal transport with transformers: The essence of depth and prompt engineering
Hadi Daneshmand
OT
29
0
0
25 Oct 2024
DiP-GO: A Diffusion Pruner via Few-step Gradient Optimization
DiP-GO: A Diffusion Pruner via Few-step Gradient Optimization
Haowei Zhu
Dehua Tang
Ji Liu
Mingjie Lu
Jintu Zheng
...
Spandan Tiwari
Ashish Sirasao
Jun-Hai Yong
Bin Wang
E. Barsoum
DiffM
24
5
0
22 Oct 2024
Generalized Probabilistic Attention Mechanism in Transformers
Generalized Probabilistic Attention Mechanism in Transformers
DongNyeong Heo
Heeyoul Choi
49
0
0
21 Oct 2024
Towards Better Multi-head Attention via Channel-wise Sample Permutation
Towards Better Multi-head Attention via Channel-wise Sample Permutation
Shen Yuan
Hongteng Xu
17
1
0
14 Oct 2024
Lambda-Skip Connections: the architectural component that prevents Rank Collapse
Lambda-Skip Connections: the architectural component that prevents Rank Collapse
Federico Arangath Joseph
Jerome Sieber
M. Zeilinger
Carmen Amo Alonso
33
0
0
14 Oct 2024
t-READi: Transformer-Powered Robust and Efficient Multimodal Inference
  for Autonomous Driving
t-READi: Transformer-Powered Robust and Efficient Multimodal Inference for Autonomous Driving
Pengfei Hu
Yuhang Qian
Tianyue Zheng
Ang Li
Zhe Chen
Yue Gao
Xiuzhen Cheng
Jun-Jie Luo
26
0
0
13 Oct 2024
Pretraining Graph Transformers with Atom-in-a-Molecule Quantum
  Properties for Improved ADMET Modeling
Pretraining Graph Transformers with Atom-in-a-Molecule Quantum Properties for Improved ADMET Modeling
Alessio Fallani
Ramil I. Nugmanov
Jose A. Arjona-Medina
Jörg Kurt Wegner
Alexandre Tkatchenko
Kostiantyn Chernichenko
MedIm
AI4CE
29
0
0
10 Oct 2024
LaMP: Language-Motion Pretraining for Motion Generation, Retrieval, and Captioning
LaMP: Language-Motion Pretraining for Motion Generation, Retrieval, and Captioning
Zhe Li
Weihao Yuan
Yisheng He
Lingteng Qiu
Shenhao Zhu
Xiaodong Gu
Weichao Shen
Yuan Dong
Zilong Dong
Laurence T. Yang
29
8
0
09 Oct 2024
Does RoBERTa Perform Better than BERT in Continual Learning: An
  Attention Sink Perspective
Does RoBERTa Perform Better than BERT in Continual Learning: An Attention Sink Perspective
Xueying Bai
Yifan Sun
Niranjan Balasubramanian
CLL
24
0
0
08 Oct 2024
Dynamic Diffusion Transformer
Dynamic Diffusion Transformer
Wangbo Zhao
Yizeng Han
Jiasheng Tang
Kai Wang
Yibing Song
Gao Huang
Fan Wang
Yang You
77
11
0
04 Oct 2024
MacDiff: Unified Skeleton Modeling with Masked Conditional Diffusion
MacDiff: Unified Skeleton Modeling with Masked Conditional Diffusion
Lehong Wu
Lilang Lin
Jiahang Zhang
Y. Ma
Jiaying Liu
DiffM
46
0
0
16 Sep 2024
Increasing transformer token length with a Maximum Entropy Principle
  Method
Increasing transformer token length with a Maximum Entropy Principle Method
R. I. Cukier
18
1
0
17 Aug 2024
Attention Is All You Need But You Don't Need All Of It For Inference of
  Large Language Models
Attention Is All You Need But You Don't Need All Of It For Inference of Large Language Models
Georgy Tyukin
G. Dovonon
Jean Kaddour
Pasquale Minervini
LRM
31
0
0
22 Jul 2024
Beyond Prompt Learning: Continual Adapter for Efficient Rehearsal-Free
  Continual Learning
Beyond Prompt Learning: Continual Adapter for Efficient Rehearsal-Free Continual Learning
Xinyuan Gao
Songlin Dong
Yuhang He
Qiang Wang
Yihong Gong
CLL
24
13
0
14 Jul 2024
Adaptive Parametric Activation
Adaptive Parametric Activation
Konstantinos Panagiotis Alexandridis
Jiankang Deng
Anh Nguyen
Shan Luo
36
2
0
11 Jul 2024
Reasoning in Large Language Models: A Geometric Perspective
Reasoning in Large Language Models: A Geometric Perspective
Romain Cosentino
Sarath Shekkizhar
LRM
44
2
0
02 Jul 2024
12345
Next