ResearchTrend.AI
  • Papers
  • Communities
  • Organizations
  • Events
  • Blog
  • Pricing
  • Feedback
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2302.10322
  4. Cited By
Deep Transformers without Shortcuts: Modifying Self-attention for
  Faithful Signal Propagation

Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation

20 February 2023
Bobby He
James Martens
Guodong Zhang
Aleksandar Botev
Andy Brock
Samuel L. Smith
Yee Whye Teh
ArXiv (abs)PDFHTML

Papers citing "Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation"

30 / 30 papers shown
Title
ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving
Yongkang Li
Kaixin Xiong
Xiangyu Guo
Fang Li
Sixu Yan
...
Bing Wang
Guang Chen
Hangjun Ye
Wenyu Liu
Xinggang Wang
VLM
88
7
0
09 Jun 2025
Always Skip Attention
Always Skip Attention
Yiping Ji
Hemanth Saratchandran
Peyman Moghaddam
Simon Lucey
607
3
0
04 May 2025
Revisiting Transformers through the Lens of Low Entropy and Dynamic Sparsity
Revisiting Transformers through the Lens of Low Entropy and Dynamic Sparsity
Ruifeng Ren
Yong Liu
562
1
0
26 Apr 2025
Bridging the Dimensional Chasm: Uncover Layer-wise Dimensional Reduction in Transformers through Token Correlation
Bridging the Dimensional Chasm: Uncover Layer-wise Dimensional Reduction in Transformers through Token Correlation
Zhuo-Yang Song
Zeyu Li
Qing-Hong Cao
Ming-xing Luo
Hua Xing Zhu
108
1
0
28 Mar 2025
The Geometry of Tokens in Internal Representations of Large Language Models
The Geometry of Tokens in Internal Representations of Large Language Models
Karthik Viswanathan
Yuri Gardinazzi
Giada Panerai
Alberto Cazzaniga
Matteo Biagetti
AIFin
278
8
0
17 Jan 2025
Generalized Probabilistic Attention Mechanism in Transformers
Generalized Probabilistic Attention Mechanism in Transformers
DongNyeong Heo
Heeyoul Choi
144
1
0
21 Oct 2024
AERO: Softmax-Only LLMs for Efficient Private Inference
AERO: Softmax-Only LLMs for Efficient Private Inference
N. Jha
Brandon Reagen
148
5
0
16 Oct 2024
Lambda-Skip Connections: the architectural component that prevents Rank Collapse
Lambda-Skip Connections: the architectural component that prevents Rank Collapse
Federico Arangath Joseph
Jerome Sieber
Melanie Zeilinger
Carmen Amo Alonso
267
1
0
14 Oct 2024
ReLU's Revival: On the Entropic Overload in Normalization-Free Large
  Language Models
ReLU's Revival: On the Entropic Overload in Normalization-Free Large Language Models
N. Jha
Brandon Reagen
OffRLAI4CE
135
3
0
12 Oct 2024
Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization
Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization
Xinhao Yao
Hongjin Qian
Xiaolin Hu
Gengze Xu
Wei Liu
Jian Luan
Bin Wang
Teli Ma
184
3
0
03 Oct 2024
Attention layers provably solve single-location regression
Attention layers provably solve single-location regression
Pierre Marion
Raphael Berthier
Gérard Biau
Claire Boyer
620
7
0
02 Oct 2024
Attention is a smoothed cubic spline
Attention is a smoothed cubic spline
Zehua Lai
Lek-Heng Lim
Yucong Liu
81
3
0
19 Aug 2024
The Impact of Initialization on LoRA Finetuning Dynamics
The Impact of Initialization on LoRA Finetuning Dynamics
Soufiane Hayou
Nikhil Ghosh
Bin Yu
AI4CE
127
31
0
12 Jun 2024
Understanding and Minimising Outlier Features in Neural Network Training
Understanding and Minimising Outlier Features in Neural Network Training
Bobby He
Lorenzo Noci
Daniele Paliotta
Imanol Schlag
Thomas Hofmann
117
6
0
29 May 2024
On the Role of Attention Masks and LayerNorm in Transformers
On the Role of Attention Masks and LayerNorm in Transformers
Xinyi Wu
A. Ajorlou
Yifei Wang
Stefanie Jegelka
Ali Jadbabaie
129
16
0
29 May 2024
Transformer tricks: Removing weights for skipless transformers
Transformer tricks: Removing weights for skipless transformers
Nils Graef
105
2
0
18 Apr 2024
Learning Unified Reference Representation for Unsupervised Multi-class
  Anomaly Detection
Learning Unified Reference Representation for Unsupervised Multi-class Anomaly Detection
Liren He
Zhengkai Jiang
Jinlong Peng
Liang Liu
Qiangang Du
Xiaobin Hu
Wenbing Zhu
Mingmin Chi
Yabiao Wang
Chengjie Wang
129
17
0
18 Mar 2024
Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent
  on Language Models
Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models
Frederik Kunstner
Robin Yadav
Alan Milligan
Mark Schmidt
Alberto Bietti
137
42
0
29 Feb 2024
Disentangling the Causes of Plasticity Loss in Neural Networks
Disentangling the Causes of Plasticity Loss in Neural Networks
Clare Lyle
Zeyu Zheng
Khimya Khetarpal
H. V. Hasselt
Razvan Pascanu
James Martens
Will Dabney
AI4CE
177
6
0
29 Feb 2024
LoRA+: Efficient Low Rank Adaptation of Large Models
LoRA+: Efficient Low Rank Adaptation of Large Models
Soufiane Hayou
Nikhil Ghosh
Bin Yu
AI4CE
174
240
0
19 Feb 2024
SAMformer: Unlocking the Potential of Transformers in Time Series
  Forecasting with Sharpness-Aware Minimization and Channel-Wise Attention
SAMformer: Unlocking the Potential of Transformers in Time Series Forecasting with Sharpness-Aware Minimization and Channel-Wise Attention
Romain Ilbert
Ambroise Odonnat
Vasilii Feofanov
Aladin Virmaux
Giuseppe Paolo
Themis Palpanas
I. Redko
AI4TS
179
37
0
15 Feb 2024
BiPFT: Binary Pre-trained Foundation Transformer with Low-rank
  Estimation of Binarization Residual Polynomials
BiPFT: Binary Pre-trained Foundation Transformer with Low-rank Estimation of Binarization Residual Polynomials
Xingrun Xing
Li Du
Xinyuan Wang
Xianlin Zeng
Yequan Wang
Zheng Zhang
Jiajun Zhang
112
3
0
14 Dec 2023
Why "classic" Transformers are shallow and how to make them go deep
Why "classic" Transformers are shallow and how to make them go deep
Yueyao Yu
Yin Zhang
ViT
163
0
0
11 Dec 2023
Simplifying Transformer Blocks
Simplifying Transformer Blocks
Bobby He
Thomas Hofmann
133
37
0
03 Nov 2023
ScaleLong: Towards More Stable Training of Diffusion Model via Scaling
  Network Long Skip Connection
ScaleLong: Towards More Stable Training of Diffusion Model via Scaling Network Long Skip Connection
Zhongzhan Huang
Pan Zhou
Shuicheng Yan
Liang Lin
140
33
0
20 Oct 2023
LEMON: Lossless model expansion
LEMON: Lossless model expansion
Yite Wang
Jiahao Su
Hanlin Lu
Cong Xie
Tianyi Liu
Jianbo Yuan
Yanghua Peng
Ruoyu Sun
Hongxia Yang
95
15
0
12 Oct 2023
Towards Training Without Depth Limits: Batch Normalization Without
  Gradient Explosion
Towards Training Without Depth Limits: Batch Normalization Without Gradient Explosion
Alexandru Meterez
Amir Joudaki
Francesco Orabona
Alexander Immer
Gunnar Rätsch
Hadi Daneshmand
97
8
0
03 Oct 2023
The Shaped Transformer: Attention Models in the Infinite Depth-and-Width
  Limit
The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit
Lorenzo Noci
Chuning Li
Mufan Li
Bobby He
Thomas Hofmann
Chris J. Maddison
Daniel M. Roy
154
38
0
30 Jun 2023
On the impact of activation and normalization in obtaining isometric
  embeddings at initialization
On the impact of activation and normalization in obtaining isometric embeddings at initialization
Amir Joudaki
Hadi Daneshmand
Francis R. Bach
88
10
0
28 May 2023
Mimetic Initialization of Self-Attention Layers
Mimetic Initialization of Self-Attention Layers
Asher Trockman
J. Zico Kolter
110
40
0
16 May 2023
1