Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2103.03404
Cited By
Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth
5 March 2021
Yihe Dong
Jean-Baptiste Cordonnier
Andreas Loukas
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth"
50 / 238 papers shown
Title
Encourage or Inhibit Monosemanticity? Revisit Monosemanticity from a Feature Decorrelation Perspective
Hanqi Yan
Yanzheng Xiang
Guangyi Chen
Yifei Wang
Lin Gui
Yulan He
35
5
0
25 Jun 2024
Transformer Normalisation Layers and the Independence of Semantic Subspaces
S. Menary
Samuel Kaski
Andre Freitas
41
2
0
25 Jun 2024
Unveiling the Hidden Structure of Self-Attention via Kernel Principal Component Analysis
R. Teo
Tan M. Nguyen
43
4
0
19 Jun 2024
Residual Connections and Normalization Can Provably Prevent Oversmoothing in GNNs
Michael Scholkemper
Xinyi Wu
Ali Jadbabaie
Michael T. Schaub
26
6
0
05 Jun 2024
TAIA: Large Language Models are Out-of-Distribution Data Learners
Shuyang Jiang
Yusheng Liao
Ya-Qin Zhang
Yu Wang
Yanfeng Wang
27
3
0
30 May 2024
Understanding and Minimising Outlier Features in Neural Network Training
Bobby He
Lorenzo Noci
Daniele Paliotta
Imanol Schlag
Thomas Hofmann
34
3
0
29 May 2024
On the Role of Attention Masks and LayerNorm in Transformers
Xinyi Wu
A. Ajorlou
Yifei Wang
Stefanie Jegelka
Ali Jadbabaie
35
9
0
29 May 2024
Effective Layer Pruning Through Similarity Metric Perspective
Ian Pons
Bruno Yamamoto
Anna H. Reali Costa
Artur Jordao
46
2
0
27 May 2024
Infinite Limits of Multi-head Transformer Dynamics
Blake Bordelon
Hamza Tahir Chaudhry
C. Pehlevan
AI4CE
42
9
0
24 May 2024
Stereo-Knowledge Distillation from dpMV to Dual Pixels for Light Field Video Reconstruction
Aryan Garg
Raghav Mallampali
Akshat Joshi
Shrisudhan Govindarajan
Kaushik Mitra
29
0
0
20 May 2024
KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization
Tianyi Zhang
Jonah Yi
Zhaozhuo Xu
Anshumali Shrivastava
MQ
29
26
0
07 May 2024
EdgeFusion: On-Device Text-to-Image Generation
Thibault Castells
Hyoung-Kyu Song
Tairen Piao
Shinkook Choi
Bo-Kyeong Kim
Hanyoung Yim
Changgwun Lee
Jae Gon Kim
Tae-Ho Kim
VLM
29
6
0
18 Apr 2024
Adapting LLaMA Decoder to Vision Transformer
Jiahao Wang
Wenqi Shao
Mengzhao Chen
Chengyue Wu
Yong Liu
Taiqiang Wu
Kaipeng Zhang
Songyang Zhang
Kai-xiang Chen
Ping Luo
MLLM
38
4
0
10 Apr 2024
T-DEED: Temporal-Discriminability Enhancer Encoder-Decoder for Precise Event Spotting in Sports Videos
Artur Xarles
Sergio Escalera
T. Moeslund
Albert Clapés
42
7
0
08 Apr 2024
Generative Retrieval as Multi-Vector Dense Retrieval
Shiguang Wu
Wenda Wei
Mengqi Zhang
Zhumin Chen
Jun Ma
Zhaochun Ren
Maarten de Rijke
Pengjie Ren
3DV
26
7
0
31 Mar 2024
Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models
Qiong Wu
Weihao Ye
Yiyi Zhou
Xiaoshuai Sun
Rongrong Ji
MoE
38
1
0
22 Mar 2024
TRELM: Towards Robust and Efficient Pre-training for Knowledge-Enhanced Language Models
Junbing Yan
Chengyu Wang
Taolin Zhang
Xiao-Mei He
Junyuan Huang
Longtao Huang
Hui Xue
Wei Zhang
VLM
KELM
24
0
0
17 Mar 2024
Mechanics of Next Token Prediction with Self-Attention
Yingcong Li
Yixiao Huang
M. E. Ildiz
A. S. Rawat
Samet Oymak
32
25
0
12 Mar 2024
Attacking Transformers with Feature Diversity Adversarial Perturbation
Chenxing Gao
Hang Zhou
Junqing Yu
Yuteng Ye
Jiale Cai
Junle Wang
Wei Yang
AAML
32
3
0
10 Mar 2024
Geometric Dynamics of Signal Propagation Predict Trainability of Transformers
Aditya Cowsik
Tamra M. Nebabu
Xiao-Liang Qi
Surya Ganguli
18
9
0
05 Mar 2024
A Comprehensive Survey on Process-Oriented Automatic Text Summarization with Exploration of LLM-Based Methods
Hanlei Jin
Yang Zhang
Dan Meng
Jun Wang
Jinghua Tan
68
79
0
05 Mar 2024
NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention
Tianyi Zhang
Jonah Yi
Bowen Yao
Zhaozhuo Xu
Anshumali Shrivastava
MQ
22
6
0
02 Mar 2024
Why Transformers Need Adam: A Hessian Perspective
Yushun Zhang
Congliang Chen
Tian Ding
Ziniu Li
Ruoyu Sun
Zhimin Luo
29
40
0
26 Feb 2024
Data-free Weight Compress and Denoise for Large Language Models
Runyu Peng
Yunhua Zhou
Qipeng Guo
Yang Gao
Hang Yan
Xipeng Qiu
Dahua Lin
33
1
0
26 Feb 2024
PIDformer: Transformer Meets Control Theory
Tam Nguyen
César A. Uribe
Tan-Minh Nguyen
Richard G. Baraniuk
48
7
0
25 Feb 2024
The Impact of LoRA on the Emergence of Clusters in Transformers
Hugo Koubbi
Matthieu Boussard
Louis Hernandez
21
1
0
23 Feb 2024
Prompting a Pretrained Transformer Can Be a Universal Approximator
Aleksandar Petrov
Philip H. S. Torr
Adel Bibi
26
11
0
22 Feb 2024
SAMformer: Unlocking the Potential of Transformers in Time Series Forecasting with Sharpness-Aware Minimization and Channel-Wise Attention
Romain Ilbert
Ambroise Odonnat
Vasilii Feofanov
Aladin Virmaux
Giuseppe Paolo
Themis Palpanas
I. Redko
AI4TS
44
21
0
15 Feb 2024
Bidirectional Generative Pre-training for Improving Time Series Representation Learning
Ziyang Song
Qincheng Lu
He Zhu
Yue Li
AI4TS
14
3
0
14 Feb 2024
Efficient Stagewise Pretraining via Progressive Subnetworks
Abhishek Panigrahi
Nikunj Saunshi
Kaifeng Lyu
Sobhan Miryoosefi
Sashank J. Reddi
Satyen Kale
Sanjiv Kumar
30
5
0
08 Feb 2024
Attention as Robust Representation for Time Series Forecasting
Peisong Niu
Tian Zhou
Xue Wang
Liang Sun
Rong Jin
AI4TS
19
4
0
08 Feb 2024
Implicit Bias and Fast Convergence Rates for Self-attention
Bhavya Vasudeva
Puneesh Deora
Christos Thrampoulidis
24
13
0
08 Feb 2024
Breaking Symmetry When Training Transformers
Chunsheng Zuo
Michael Guerzhoy
30
0
0
06 Feb 2024
Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods
Bo-Kyeong Kim
Geonmin Kim
Tae-Ho Kim
Thibault Castells
Shinkook Choi
Junho Shin
Hyoung-Kyu Song
54
30
0
05 Feb 2024
Self-attention Networks Localize When QK-eigenspectrum Concentrates
Han Bao
Ryuichiro Hataya
Ryo Karakida
11
5
0
03 Feb 2024
LIR: A Lightweight Baseline for Image Restoration
Dongqi Fan
Ting Yue
Xin Zhao
Renjing Xu
Liang Chang
19
0
0
02 Feb 2024
MVSFormer++: Revealing the Devil in Transformer's Details for Multi-View Stereo
Chenjie Cao
Xinlin Ren
Yanwei Fu
21
25
0
22 Jan 2024
AAT: Adapting Audio Transformer for Various Acoustics Recognition Tasks
Yun Liang
Hai Lin
Shaojian Qiu
Yihang Zhang
16
1
0
19 Jan 2024
When Large Language Models Meet Evolutionary Algorithms: Potential Enhancements and Challenges
Wang Chao
Jiaxuan Zhao
Licheng Jiao
Lingling Li
Fang Liu
Shuyuan Yang
61
13
0
19 Jan 2024
UPDP: A Unified Progressive Depth Pruner for CNN and Vision Transformer
Ji Liu
Dehua Tang
Yuanxian Huang
Li Lyna Zhang
Xiaocheng Zeng
...
Jinzhang Peng
Yu-Chiang Frank Wang
Fan Jiang
Lu Tian
Ashish Sirasao
ViT
22
7
0
12 Jan 2024
Setting the Record Straight on Transformer Oversmoothing
G. Dovonon
M. Bronstein
Matt J. Kusner
20
5
0
09 Jan 2024
PanGu-
π
π
π
: Enhancing Language Model Architectures via Nonlinearity Compensation
Yunhe Wang
Hanting Chen
Yehui Tang
Tianyu Guo
Kai Han
...
Qinghua Xu
Qun Liu
Jun Yao
Chao Xu
Dacheng Tao
59
15
0
27 Dec 2023
Generating and Reweighting Dense Contrastive Patterns for Unsupervised Anomaly Detection
Songmin Dai
Yifan Wu
Xiaoqiang Li
Xiangyang Xue
25
12
0
26 Dec 2023
Pixel-to-Abundance Translation: Conditional Generative Adversarial Networks Based on Patch Transformer for Hyperspectral Unmixing
Li Wang
Xiaohua Zhang
Longfei Li
Hong-yun Meng
Xianghai Cao
21
3
0
20 Dec 2023
A mathematical perspective on Transformers
Borjan Geshkovski
Cyril Letrouit
Yury Polyanskiy
Philippe Rigollet
EDL
AI4CE
40
36
0
17 Dec 2023
An Attentive Inductive Bias for Sequential Recommendation beyond the Self-Attention
Yehjin Shin
Jeongwhan Choi
Hyowon Wi
Noseong Park
38
29
0
16 Dec 2023
Auto-Prox: Training-Free Vision Transformer Architecture Search via Automatic Proxy Discovery
Zimian Wei
Lujun Li
Peijie Dong
Zheng Hui
Anggeng Li
Menglong Lu
H. Pan
Zhiliang Tian
Dongsheng Li
ViT
37
16
0
14 Dec 2023
Polynomial-based Self-Attention for Table Representation learning
Jayoung Kim
Yehjin Shin
Jeongwhan Choi
Hyowon Wi
Noseong Park
LMTD
19
2
0
12 Dec 2023
Why "classic" Transformers are shallow and how to make them go deep
Yueyao Yu
Yin Zhang
ViT
16
0
0
11 Dec 2023
Graph Convolutions Enrich the Self-Attention in Transformers!
Jeongwhan Choi
Hyowon Wi
Jayoung Kim
Yehjin Shin
Kookjin Lee
Nathaniel Trask
Noseong Park
25
4
0
07 Dec 2023
Previous
1
2
3
4
5
Next