Papers
Communities
Organizations
Events
Blog
Pricing
Feedback
Contact Sales
Search
Open menu
Home
Papers
2302.10322
Cited By
Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation
20 February 2023
Bobby He
James Martens
Guodong Zhang
Aleksandar Botev
Andy Brock
Samuel L. Smith
Yee Whye Teh
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation"
30 / 30 papers shown
Title
ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving
Yongkang Li
Kaixin Xiong
Xiangyu Guo
Fang Li
Sixu Yan
...
Bing Wang
Guang Chen
Hangjun Ye
Wenyu Liu
Xinggang Wang
VLM
88
7
0
09 Jun 2025
Always Skip Attention
Yiping Ji
Hemanth Saratchandran
Peyman Moghaddam
Simon Lucey
607
3
0
04 May 2025
Revisiting Transformers through the Lens of Low Entropy and Dynamic Sparsity
Ruifeng Ren
Yong Liu
562
1
0
26 Apr 2025
Bridging the Dimensional Chasm: Uncover Layer-wise Dimensional Reduction in Transformers through Token Correlation
Zhuo-Yang Song
Zeyu Li
Qing-Hong Cao
Ming-xing Luo
Hua Xing Zhu
108
1
0
28 Mar 2025
The Geometry of Tokens in Internal Representations of Large Language Models
Karthik Viswanathan
Yuri Gardinazzi
Giada Panerai
Alberto Cazzaniga
Matteo Biagetti
AIFin
278
8
0
17 Jan 2025
Generalized Probabilistic Attention Mechanism in Transformers
DongNyeong Heo
Heeyoul Choi
144
1
0
21 Oct 2024
AERO: Softmax-Only LLMs for Efficient Private Inference
N. Jha
Brandon Reagen
148
5
0
16 Oct 2024
Lambda-Skip Connections: the architectural component that prevents Rank Collapse
Federico Arangath Joseph
Jerome Sieber
Melanie Zeilinger
Carmen Amo Alonso
267
1
0
14 Oct 2024
ReLU's Revival: On the Entropic Overload in Normalization-Free Large Language Models
N. Jha
Brandon Reagen
OffRL
AI4CE
135
3
0
12 Oct 2024
Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization
Xinhao Yao
Hongjin Qian
Xiaolin Hu
Gengze Xu
Wei Liu
Jian Luan
Bin Wang
Teli Ma
184
3
0
03 Oct 2024
Attention layers provably solve single-location regression
Pierre Marion
Raphael Berthier
Gérard Biau
Claire Boyer
620
7
0
02 Oct 2024
Attention is a smoothed cubic spline
Zehua Lai
Lek-Heng Lim
Yucong Liu
81
3
0
19 Aug 2024
The Impact of Initialization on LoRA Finetuning Dynamics
Soufiane Hayou
Nikhil Ghosh
Bin Yu
AI4CE
127
31
0
12 Jun 2024
Understanding and Minimising Outlier Features in Neural Network Training
Bobby He
Lorenzo Noci
Daniele Paliotta
Imanol Schlag
Thomas Hofmann
117
6
0
29 May 2024
On the Role of Attention Masks and LayerNorm in Transformers
Xinyi Wu
A. Ajorlou
Yifei Wang
Stefanie Jegelka
Ali Jadbabaie
129
16
0
29 May 2024
Transformer tricks: Removing weights for skipless transformers
Nils Graef
105
2
0
18 Apr 2024
Learning Unified Reference Representation for Unsupervised Multi-class Anomaly Detection
Liren He
Zhengkai Jiang
Jinlong Peng
Liang Liu
Qiangang Du
Xiaobin Hu
Wenbing Zhu
Mingmin Chi
Yabiao Wang
Chengjie Wang
129
17
0
18 Mar 2024
Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models
Frederik Kunstner
Robin Yadav
Alan Milligan
Mark Schmidt
Alberto Bietti
137
42
0
29 Feb 2024
Disentangling the Causes of Plasticity Loss in Neural Networks
Clare Lyle
Zeyu Zheng
Khimya Khetarpal
H. V. Hasselt
Razvan Pascanu
James Martens
Will Dabney
AI4CE
177
6
0
29 Feb 2024
LoRA+: Efficient Low Rank Adaptation of Large Models
Soufiane Hayou
Nikhil Ghosh
Bin Yu
AI4CE
174
240
0
19 Feb 2024
SAMformer: Unlocking the Potential of Transformers in Time Series Forecasting with Sharpness-Aware Minimization and Channel-Wise Attention
Romain Ilbert
Ambroise Odonnat
Vasilii Feofanov
Aladin Virmaux
Giuseppe Paolo
Themis Palpanas
I. Redko
AI4TS
179
37
0
15 Feb 2024
BiPFT: Binary Pre-trained Foundation Transformer with Low-rank Estimation of Binarization Residual Polynomials
Xingrun Xing
Li Du
Xinyuan Wang
Xianlin Zeng
Yequan Wang
Zheng Zhang
Jiajun Zhang
112
3
0
14 Dec 2023
Why "classic" Transformers are shallow and how to make them go deep
Yueyao Yu
Yin Zhang
ViT
163
0
0
11 Dec 2023
Simplifying Transformer Blocks
Bobby He
Thomas Hofmann
133
37
0
03 Nov 2023
ScaleLong: Towards More Stable Training of Diffusion Model via Scaling Network Long Skip Connection
Zhongzhan Huang
Pan Zhou
Shuicheng Yan
Liang Lin
140
33
0
20 Oct 2023
LEMON: Lossless model expansion
Yite Wang
Jiahao Su
Hanlin Lu
Cong Xie
Tianyi Liu
Jianbo Yuan
Yanghua Peng
Ruoyu Sun
Hongxia Yang
95
15
0
12 Oct 2023
Towards Training Without Depth Limits: Batch Normalization Without Gradient Explosion
Alexandru Meterez
Amir Joudaki
Francesco Orabona
Alexander Immer
Gunnar Rätsch
Hadi Daneshmand
97
8
0
03 Oct 2023
The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit
Lorenzo Noci
Chuning Li
Mufan Li
Bobby He
Thomas Hofmann
Chris J. Maddison
Daniel M. Roy
154
38
0
30 Jun 2023
On the impact of activation and normalization in obtaining isometric embeddings at initialization
Amir Joudaki
Hadi Daneshmand
Francis R. Bach
88
10
0
28 May 2023
Mimetic Initialization of Self-Attention Layers
Asher Trockman
J. Zico Kolter
110
40
0
16 May 2023
1