ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2103.03404
  4. Cited By
Attention is Not All You Need: Pure Attention Loses Rank Doubly
  Exponentially with Depth

Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth

5 March 2021
Yihe Dong
Jean-Baptiste Cordonnier
Andreas Loukas
ArXivPDFHTML

Papers citing "Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth"

50 / 238 papers shown
Title
Encourage or Inhibit Monosemanticity? Revisit Monosemanticity from a
  Feature Decorrelation Perspective
Encourage or Inhibit Monosemanticity? Revisit Monosemanticity from a Feature Decorrelation Perspective
Hanqi Yan
Yanzheng Xiang
Guangyi Chen
Yifei Wang
Lin Gui
Yulan He
35
5
0
25 Jun 2024
Transformer Normalisation Layers and the Independence of Semantic
  Subspaces
Transformer Normalisation Layers and the Independence of Semantic Subspaces
S. Menary
Samuel Kaski
Andre Freitas
41
2
0
25 Jun 2024
Unveiling the Hidden Structure of Self-Attention via Kernel Principal
  Component Analysis
Unveiling the Hidden Structure of Self-Attention via Kernel Principal Component Analysis
R. Teo
Tan M. Nguyen
43
4
0
19 Jun 2024
Residual Connections and Normalization Can Provably Prevent
  Oversmoothing in GNNs
Residual Connections and Normalization Can Provably Prevent Oversmoothing in GNNs
Michael Scholkemper
Xinyi Wu
Ali Jadbabaie
Michael T. Schaub
26
6
0
05 Jun 2024
TAIA: Large Language Models are Out-of-Distribution Data Learners
TAIA: Large Language Models are Out-of-Distribution Data Learners
Shuyang Jiang
Yusheng Liao
Ya-Qin Zhang
Yu Wang
Yanfeng Wang
27
3
0
30 May 2024
Understanding and Minimising Outlier Features in Neural Network Training
Understanding and Minimising Outlier Features in Neural Network Training
Bobby He
Lorenzo Noci
Daniele Paliotta
Imanol Schlag
Thomas Hofmann
34
3
0
29 May 2024
On the Role of Attention Masks and LayerNorm in Transformers
On the Role of Attention Masks and LayerNorm in Transformers
Xinyi Wu
A. Ajorlou
Yifei Wang
Stefanie Jegelka
Ali Jadbabaie
35
9
0
29 May 2024
Effective Layer Pruning Through Similarity Metric Perspective
Effective Layer Pruning Through Similarity Metric Perspective
Ian Pons
Bruno Yamamoto
Anna H. Reali Costa
Artur Jordao
46
2
0
27 May 2024
Infinite Limits of Multi-head Transformer Dynamics
Infinite Limits of Multi-head Transformer Dynamics
Blake Bordelon
Hamza Tahir Chaudhry
C. Pehlevan
AI4CE
42
9
0
24 May 2024
Stereo-Knowledge Distillation from dpMV to Dual Pixels for Light Field
  Video Reconstruction
Stereo-Knowledge Distillation from dpMV to Dual Pixels for Light Field Video Reconstruction
Aryan Garg
Raghav Mallampali
Akshat Joshi
Shrisudhan Govindarajan
Kaushik Mitra
29
0
0
20 May 2024
KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference
  with Coupled Quantization
KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization
Tianyi Zhang
Jonah Yi
Zhaozhuo Xu
Anshumali Shrivastava
MQ
29
26
0
07 May 2024
EdgeFusion: On-Device Text-to-Image Generation
EdgeFusion: On-Device Text-to-Image Generation
Thibault Castells
Hyoung-Kyu Song
Tairen Piao
Shinkook Choi
Bo-Kyeong Kim
Hanyoung Yim
Changgwun Lee
Jae Gon Kim
Tae-Ho Kim
VLM
29
6
0
18 Apr 2024
Adapting LLaMA Decoder to Vision Transformer
Adapting LLaMA Decoder to Vision Transformer
Jiahao Wang
Wenqi Shao
Mengzhao Chen
Chengyue Wu
Yong Liu
Taiqiang Wu
Kaipeng Zhang
Songyang Zhang
Kai-xiang Chen
Ping Luo
MLLM
38
4
0
10 Apr 2024
T-DEED: Temporal-Discriminability Enhancer Encoder-Decoder for Precise
  Event Spotting in Sports Videos
T-DEED: Temporal-Discriminability Enhancer Encoder-Decoder for Precise Event Spotting in Sports Videos
Artur Xarles
Sergio Escalera
T. Moeslund
Albert Clapés
42
7
0
08 Apr 2024
Generative Retrieval as Multi-Vector Dense Retrieval
Generative Retrieval as Multi-Vector Dense Retrieval
Shiguang Wu
Wenda Wei
Mengqi Zhang
Zhumin Chen
Jun Ma
Zhaochun Ren
Maarten de Rijke
Pengjie Ren
3DV
26
7
0
31 Mar 2024
Not All Attention is Needed: Parameter and Computation Efficient
  Transfer Learning for Multi-modal Large Language Models
Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models
Qiong Wu
Weihao Ye
Yiyi Zhou
Xiaoshuai Sun
Rongrong Ji
MoE
38
1
0
22 Mar 2024
TRELM: Towards Robust and Efficient Pre-training for Knowledge-Enhanced
  Language Models
TRELM: Towards Robust and Efficient Pre-training for Knowledge-Enhanced Language Models
Junbing Yan
Chengyu Wang
Taolin Zhang
Xiao-Mei He
Junyuan Huang
Longtao Huang
Hui Xue
Wei Zhang
VLM
KELM
24
0
0
17 Mar 2024
Mechanics of Next Token Prediction with Self-Attention
Mechanics of Next Token Prediction with Self-Attention
Yingcong Li
Yixiao Huang
M. E. Ildiz
A. S. Rawat
Samet Oymak
32
25
0
12 Mar 2024
Attacking Transformers with Feature Diversity Adversarial Perturbation
Attacking Transformers with Feature Diversity Adversarial Perturbation
Chenxing Gao
Hang Zhou
Junqing Yu
Yuteng Ye
Jiale Cai
Junle Wang
Wei Yang
AAML
32
3
0
10 Mar 2024
Geometric Dynamics of Signal Propagation Predict Trainability of
  Transformers
Geometric Dynamics of Signal Propagation Predict Trainability of Transformers
Aditya Cowsik
Tamra M. Nebabu
Xiao-Liang Qi
Surya Ganguli
18
9
0
05 Mar 2024
A Comprehensive Survey on Process-Oriented Automatic Text Summarization with Exploration of LLM-Based Methods
A Comprehensive Survey on Process-Oriented Automatic Text Summarization with Exploration of LLM-Based Methods
Hanlei Jin
Yang Zhang
Dan Meng
Jun Wang
Jinghua Tan
68
79
0
05 Mar 2024
NoMAD-Attention: Efficient LLM Inference on CPUs Through
  Multiply-add-free Attention
NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention
Tianyi Zhang
Jonah Yi
Bowen Yao
Zhaozhuo Xu
Anshumali Shrivastava
MQ
22
6
0
02 Mar 2024
Why Transformers Need Adam: A Hessian Perspective
Why Transformers Need Adam: A Hessian Perspective
Yushun Zhang
Congliang Chen
Tian Ding
Ziniu Li
Ruoyu Sun
Zhimin Luo
29
40
0
26 Feb 2024
Data-free Weight Compress and Denoise for Large Language Models
Data-free Weight Compress and Denoise for Large Language Models
Runyu Peng
Yunhua Zhou
Qipeng Guo
Yang Gao
Hang Yan
Xipeng Qiu
Dahua Lin
33
1
0
26 Feb 2024
PIDformer: Transformer Meets Control Theory
PIDformer: Transformer Meets Control Theory
Tam Nguyen
César A. Uribe
Tan-Minh Nguyen
Richard G. Baraniuk
48
7
0
25 Feb 2024
The Impact of LoRA on the Emergence of Clusters in Transformers
The Impact of LoRA on the Emergence of Clusters in Transformers
Hugo Koubbi
Matthieu Boussard
Louis Hernandez
21
1
0
23 Feb 2024
Prompting a Pretrained Transformer Can Be a Universal Approximator
Prompting a Pretrained Transformer Can Be a Universal Approximator
Aleksandar Petrov
Philip H. S. Torr
Adel Bibi
26
11
0
22 Feb 2024
SAMformer: Unlocking the Potential of Transformers in Time Series
  Forecasting with Sharpness-Aware Minimization and Channel-Wise Attention
SAMformer: Unlocking the Potential of Transformers in Time Series Forecasting with Sharpness-Aware Minimization and Channel-Wise Attention
Romain Ilbert
Ambroise Odonnat
Vasilii Feofanov
Aladin Virmaux
Giuseppe Paolo
Themis Palpanas
I. Redko
AI4TS
44
21
0
15 Feb 2024
Bidirectional Generative Pre-training for Improving Time Series
  Representation Learning
Bidirectional Generative Pre-training for Improving Time Series Representation Learning
Ziyang Song
Qincheng Lu
He Zhu
Yue Li
AI4TS
14
3
0
14 Feb 2024
Efficient Stagewise Pretraining via Progressive Subnetworks
Efficient Stagewise Pretraining via Progressive Subnetworks
Abhishek Panigrahi
Nikunj Saunshi
Kaifeng Lyu
Sobhan Miryoosefi
Sashank J. Reddi
Satyen Kale
Sanjiv Kumar
30
5
0
08 Feb 2024
Attention as Robust Representation for Time Series Forecasting
Attention as Robust Representation for Time Series Forecasting
Peisong Niu
Tian Zhou
Xue Wang
Liang Sun
Rong Jin
AI4TS
19
4
0
08 Feb 2024
Implicit Bias and Fast Convergence Rates for Self-attention
Implicit Bias and Fast Convergence Rates for Self-attention
Bhavya Vasudeva
Puneesh Deora
Christos Thrampoulidis
24
13
0
08 Feb 2024
Breaking Symmetry When Training Transformers
Breaking Symmetry When Training Transformers
Chunsheng Zuo
Michael Guerzhoy
30
0
0
06 Feb 2024
Shortened LLaMA: Depth Pruning for Large Language Models with Comparison
  of Retraining Methods
Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods
Bo-Kyeong Kim
Geonmin Kim
Tae-Ho Kim
Thibault Castells
Shinkook Choi
Junho Shin
Hyoung-Kyu Song
54
30
0
05 Feb 2024
Self-attention Networks Localize When QK-eigenspectrum Concentrates
Self-attention Networks Localize When QK-eigenspectrum Concentrates
Han Bao
Ryuichiro Hataya
Ryo Karakida
11
5
0
03 Feb 2024
LIR: A Lightweight Baseline for Image Restoration
LIR: A Lightweight Baseline for Image Restoration
Dongqi Fan
Ting Yue
Xin Zhao
Renjing Xu
Liang Chang
19
0
0
02 Feb 2024
MVSFormer++: Revealing the Devil in Transformer's Details for Multi-View
  Stereo
MVSFormer++: Revealing the Devil in Transformer's Details for Multi-View Stereo
Chenjie Cao
Xinlin Ren
Yanwei Fu
21
25
0
22 Jan 2024
AAT: Adapting Audio Transformer for Various Acoustics Recognition Tasks
AAT: Adapting Audio Transformer for Various Acoustics Recognition Tasks
Yun Liang
Hai Lin
Shaojian Qiu
Yihang Zhang
16
1
0
19 Jan 2024
When Large Language Models Meet Evolutionary Algorithms: Potential Enhancements and Challenges
When Large Language Models Meet Evolutionary Algorithms: Potential Enhancements and Challenges
Wang Chao
Jiaxuan Zhao
Licheng Jiao
Lingling Li
Fang Liu
Shuyuan Yang
61
13
0
19 Jan 2024
UPDP: A Unified Progressive Depth Pruner for CNN and Vision Transformer
UPDP: A Unified Progressive Depth Pruner for CNN and Vision Transformer
Ji Liu
Dehua Tang
Yuanxian Huang
Li Lyna Zhang
Xiaocheng Zeng
...
Jinzhang Peng
Yu-Chiang Frank Wang
Fan Jiang
Lu Tian
Ashish Sirasao
ViT
22
7
0
12 Jan 2024
Setting the Record Straight on Transformer Oversmoothing
Setting the Record Straight on Transformer Oversmoothing
G. Dovonon
M. Bronstein
Matt J. Kusner
20
5
0
09 Jan 2024
PanGu-$π$: Enhancing Language Model Architectures via Nonlinearity
  Compensation
PanGu-πππ: Enhancing Language Model Architectures via Nonlinearity Compensation
Yunhe Wang
Hanting Chen
Yehui Tang
Tianyu Guo
Kai Han
...
Qinghua Xu
Qun Liu
Jun Yao
Chao Xu
Dacheng Tao
59
15
0
27 Dec 2023
Generating and Reweighting Dense Contrastive Patterns for Unsupervised
  Anomaly Detection
Generating and Reweighting Dense Contrastive Patterns for Unsupervised Anomaly Detection
Songmin Dai
Yifan Wu
Xiaoqiang Li
Xiangyang Xue
25
12
0
26 Dec 2023
Pixel-to-Abundance Translation: Conditional Generative Adversarial
  Networks Based on Patch Transformer for Hyperspectral Unmixing
Pixel-to-Abundance Translation: Conditional Generative Adversarial Networks Based on Patch Transformer for Hyperspectral Unmixing
Li Wang
Xiaohua Zhang
Longfei Li
Hong-yun Meng
Xianghai Cao
21
3
0
20 Dec 2023
A mathematical perspective on Transformers
A mathematical perspective on Transformers
Borjan Geshkovski
Cyril Letrouit
Yury Polyanskiy
Philippe Rigollet
EDL
AI4CE
40
36
0
17 Dec 2023
An Attentive Inductive Bias for Sequential Recommendation beyond the
  Self-Attention
An Attentive Inductive Bias for Sequential Recommendation beyond the Self-Attention
Yehjin Shin
Jeongwhan Choi
Hyowon Wi
Noseong Park
38
29
0
16 Dec 2023
Auto-Prox: Training-Free Vision Transformer Architecture Search via
  Automatic Proxy Discovery
Auto-Prox: Training-Free Vision Transformer Architecture Search via Automatic Proxy Discovery
Zimian Wei
Lujun Li
Peijie Dong
Zheng Hui
Anggeng Li
Menglong Lu
H. Pan
Zhiliang Tian
Dongsheng Li
ViT
37
16
0
14 Dec 2023
Polynomial-based Self-Attention for Table Representation learning
Polynomial-based Self-Attention for Table Representation learning
Jayoung Kim
Yehjin Shin
Jeongwhan Choi
Hyowon Wi
Noseong Park
LMTD
19
2
0
12 Dec 2023
Why "classic" Transformers are shallow and how to make them go deep
Why "classic" Transformers are shallow and how to make them go deep
Yueyao Yu
Yin Zhang
ViT
16
0
0
11 Dec 2023
Graph Convolutions Enrich the Self-Attention in Transformers!
Graph Convolutions Enrich the Self-Attention in Transformers!
Jeongwhan Choi
Hyowon Wi
Jayoung Kim
Yehjin Shin
Kookjin Lee
Nathaniel Trask
Noseong Park
25
4
0
07 Dec 2023
Previous
12345
Next