Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation

20 February 2023

Papers citing "Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation"

30 / 30 papers shown

Title
ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving Yongkang Li Kaixin Xiong Xiangyu Guo Fang Li Sixu Yan ... Bing Wang Guang Chen Hangjun Ye Wenyu Liu Xinggang Wang VLM 88 7 0 09 Jun 2025
Always Skip Attention Yiping Ji Hemanth Saratchandran Peyman Moghaddam Simon Lucey 607 3 0 04 May 2025
Revisiting Transformers through the Lens of Low Entropy and Dynamic Sparsity Ruifeng Ren Yong Liu 562 1 0 26 Apr 2025
Bridging the Dimensional Chasm: Uncover Layer-wise Dimensional Reduction in Transformers through Token Correlation Zhuo-Yang Song Zeyu Li Qing-Hong Cao Ming-xing Luo Hua Xing Zhu 108 1 0 28 Mar 2025
The Geometry of Tokens in Internal Representations of Large Language Models Karthik Viswanathan Yuri Gardinazzi Giada Panerai Alberto Cazzaniga Matteo Biagetti AIFin 278 8 0 17 Jan 2025
Generalized Probabilistic Attention Mechanism in Transformers DongNyeong Heo Heeyoul Choi 144 1 0 21 Oct 2024
AERO: Softmax-Only LLMs for Efficient Private Inference N. Jha Brandon Reagen 148 5 0 16 Oct 2024
Lambda-Skip Connections: the architectural component that prevents Rank Collapse Federico Arangath Joseph Jerome Sieber Melanie Zeilinger Carmen Amo Alonso 267 1 0 14 Oct 2024
ReLU's Revival: On the Entropic Overload in Normalization-Free Large Language Models N. Jha Brandon Reagen OffRL AI4CE 135 3 0 12 Oct 2024
Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization Xinhao Yao Hongjin Qian Xiaolin Hu Gengze Xu Wei Liu Jian Luan Bin Wang Teli Ma 184 3 0 03 Oct 2024
Attention layers provably solve single-location regression Pierre Marion Raphael Berthier Gérard Biau Claire Boyer 620 7 0 02 Oct 2024
Attention is a smoothed cubic spline Zehua Lai Lek-Heng Lim Yucong Liu 81 3 0 19 Aug 2024
The Impact of Initialization on LoRA Finetuning Dynamics Soufiane Hayou Nikhil Ghosh Bin Yu AI4CE 127 31 0 12 Jun 2024
Understanding and Minimising Outlier Features in Neural Network Training Bobby He Lorenzo Noci Daniele Paliotta Imanol Schlag Thomas Hofmann 117 6 0 29 May 2024
On the Role of Attention Masks and LayerNorm in Transformers Xinyi Wu A. Ajorlou Yifei Wang Stefanie Jegelka Ali Jadbabaie 129 16 0 29 May 2024
Transformer tricks: Removing weights for skipless transformers Nils Graef 105 2 0 18 Apr 2024
Learning Unified Reference Representation for Unsupervised Multi-class Anomaly Detection Liren He Zhengkai Jiang Jinlong Peng Liang Liu Qiangang Du Xiaobin Hu Wenbing Zhu Mingmin Chi Yabiao Wang Chengjie Wang 129 17 0 18 Mar 2024
Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models Frederik Kunstner Robin Yadav Alan Milligan Mark Schmidt Alberto Bietti 137 42 0 29 Feb 2024
Disentangling the Causes of Plasticity Loss in Neural Networks Clare Lyle Zeyu Zheng Khimya Khetarpal H. V. Hasselt Razvan Pascanu James Martens Will Dabney AI4CE 177 6 0 29 Feb 2024
LoRA+: Efficient Low Rank Adaptation of Large Models Soufiane Hayou Nikhil Ghosh Bin Yu AI4CE 174 240 0 19 Feb 2024
SAMformer: Unlocking the Potential of Transformers in Time Series Forecasting with Sharpness-Aware Minimization and Channel-Wise Attention Romain Ilbert Ambroise Odonnat Vasilii Feofanov Aladin Virmaux Giuseppe Paolo Themis Palpanas I. Redko AI4TS 179 37 0 15 Feb 2024
BiPFT: Binary Pre-trained Foundation Transformer with Low-rank Estimation of Binarization Residual Polynomials Xingrun Xing Li Du Xinyuan Wang Xianlin Zeng Yequan Wang Zheng Zhang Jiajun Zhang 112 3 0 14 Dec 2023
Why "classic" Transformers are shallow and how to make them go deep Yueyao Yu Yin Zhang ViT 163 0 0 11 Dec 2023
Simplifying Transformer Blocks Bobby He Thomas Hofmann 133 37 0 03 Nov 2023
ScaleLong: Towards More Stable Training of Diffusion Model via Scaling Network Long Skip Connection Zhongzhan Huang Pan Zhou Shuicheng Yan Liang Lin 140 33 0 20 Oct 2023
LEMON: Lossless model expansion Yite Wang Jiahao Su Hanlin Lu Cong Xie Tianyi Liu Jianbo Yuan Yanghua Peng Ruoyu Sun Hongxia Yang 95 15 0 12 Oct 2023
Towards Training Without Depth Limits: Batch Normalization Without Gradient Explosion Alexandru Meterez Amir Joudaki Francesco Orabona Alexander Immer Gunnar Rätsch Hadi Daneshmand 97 8 0 03 Oct 2023
The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit Lorenzo Noci Chuning Li Mufan Li Bobby He Thomas Hofmann Chris J. Maddison Daniel M. Roy 154 38 0 30 Jun 2023
On the impact of activation and normalization in obtaining isometric embeddings at initialization Amir Joudaki Hadi Daneshmand Francis R. Bach 88 10 0 28 May 2023
Mimetic Initialization of Self-Attention Layers Asher Trockman J. Zico Kolter 110 40 0 16 May 2023