Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1908.11365
Cited By
Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019
29 August 2019
Biao Zhang
Ivan Titov
Rico Sennrich
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention"
50 / 71 papers shown
Frequency-Aware Token Reduction for Efficient Vision Transformer
Dong-Jae Lee
Jiwan Hur
Jaehyun Choi
Jaemyung Yu
Junmo Kim
232
0
0
26 Nov 2025
From Condensation to Rank Collapse: A Two-Stage Analysis of Transformer Training Dynamics
Zheng-an Chen
Tao Luo
AI4CE
167
1
0
08 Oct 2025
Scalable Complexity Control Facilitates Reasoning Ability of LLMs
Liangkai Hang
Junjie Yao
Zhiwei Bai
Jiahao Huo
Yang Chen
...
Feiyu Xiong
Y. Zhang
Weinan E
Hongkang Yang
Zhi-hai Xu
LRM
238
3
0
29 May 2025
Variance Control via Weight Rescaling in LLM Pre-training
Louis Owen
Abhay Kumar
Nilabhra Roy Chowdhury
Fabian Güra
260
2
0
21 Mar 2025
HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization
Zhijian Zhuo
Yutao Zeng
Ya Wang
Sijun Zhang
Jian Yang
Xiaoqing Li
Xun Zhou
Jinwen Ma
530
11
0
06 Mar 2025
The Curse of Depth in Large Language Models
Wenfang Sun
Xinyuan Song
Pengxiang Li
Lu Yin
Yefeng Zheng
Shiwei Liu
530
35
0
09 Feb 2025
Merino: Entropy-driven Design for Generative Language Models on IoT Devices
AAAI Conference on Artificial Intelligence (AAAI), 2024
Youpeng Zhao
Ming Lin
Huadong Tang
Qiang Wu
Jun Wang
431
1
0
28 Jan 2025
Complexity Control Facilitates Reasoning-Based Compositional Generalization in Transformers
Zhongwang Zhang
Pengxiao Lin
Zhiwei Wang
Yaoyu Zhang
Zhi-Qin John Xu
187
13
0
15 Jan 2025
Generalized Probabilistic Attention Mechanism in Transformers
DongNyeong Heo
Heeyoul Choi
330
3
0
21 Oct 2024
Initialization of Large Language Models via Reparameterization to Mitigate Loss Spikes
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Kosuke Nishida
Kyosuke Nishida
Kuniko Saito
283
7
0
07 Oct 2024
Language-Informed Beam Search Decoding for Multilingual Machine Translation
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Yilin Yang
Stefan Lee
Prasad Tadepalli
205
2
0
11 Aug 2024
Advancing Neural Network Performance through Emergence-Promoting Initialization Scheme
Johnny Jingze Li
V. George
Gabriel A. Silva
ODL
438
0
0
26 Jul 2024
Resolving Discrepancies in Compute-Optimal Scaling of Language Models
Tomer Porian
Mitchell Wortsman
J. Jitsev
Ludwig Schmidt
Y. Carmon
600
61
0
27 Jun 2024
Delving into Differentially Private Transformer
Youlong Ding
Xueyang Wu
Yining Meng
Yonggang Luo
Hao Wang
Weike Pan
551
11
0
28 May 2024
Initialization is Critical to Whether Transformers Fit Composite Functions by Reasoning or Memorizing
Neural Information Processing Systems (NeurIPS), 2024
Zhongwang Zhang
Pengxiao Lin
Zhiwei Wang
Yaoyu Zhang
Z. Xu
735
3
0
08 May 2024
Enhancing Length Extrapolation in Sequential Models with Pointer-Augmented Neural Memory
Hung Le
D. Nguyen
Kien Do
Svetha Venkatesh
T. Tran
279
0
0
18 Apr 2024
Language models scale reliably with over-training and on downstream tasks
International Conference on Learning Representations (ICLR), 2024
S. Gadre
Georgios Smyrnis
Vaishaal Shankar
Suchin Gururangan
Mitchell Wortsman
...
Y. Carmon
Achal Dave
Reinhard Heckel
Niklas Muennighoff
Ludwig Schmidt
ALM
ELM
LRM
383
84
0
13 Mar 2024
Why Transformers Need Adam: A Hessian Perspective
Yushun Zhang
Congliang Chen
Tian Ding
Ziniu Li
Tian Ding
Zhimin Luo
497
103
0
26 Feb 2024
Spike No More: Stabilizing the Pre-training of Large Language Models
Sho Takase
Shun Kiyono
Sosuke Kobayashi
Jun Suzuki
503
38
0
28 Dec 2023
Simplifying Transformer Blocks
International Conference on Learning Representations (ICLR), 2023
Bobby He
Thomas Hofmann
493
51
0
03 Nov 2023
Understanding Optimization of Deep Learning via Jacobian Matrix and Lipschitz Constant
Xianbiao Qi
Jianan Wang
Lei Zhang
239
0
0
15 Jun 2023
DPFormer: Learning Differentially Private Transformer on Long-Tailed Data
Youlong Ding
Xueyang Wu
Hongya Wang
Weike Pan
360
2
0
28 May 2023
BranchNorm: Robustly Scaling Extremely Deep Transformers
Annual Meeting of the Association for Computational Linguistics (ACL), 2023
Yanjun Liu
Xianfeng Zeng
Fandong Meng
Jie Zhou
200
5
0
04 May 2023
Are More Layers Beneficial to Graph Transformers?
International Conference on Learning Representations (ICLR), 2023
Haiteng Zhao
Shuming Ma
Dongdong Zhang
Zhi-Hong Deng
Furu Wei
247
17
0
01 Mar 2023
Efficient CTC Regularization via Coarse Labels for End-to-End Speech Translation
Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2023
Biao Zhang
Barry Haddow
Rico Sennrich
360
3
0
21 Feb 2023
Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation
International Conference on Learning Representations (ICLR), 2023
Bobby He
James Martens
Guodong Zhang
Aleksandar Botev
Andy Brock
Samuel L. Smith
Yee Whye Teh
290
49
0
20 Feb 2023
Optimizing Deep Transformers for Chinese-Thai Low-Resource Translation
Wenjie Hao
Hongfei Xu
Lingling Mu
Hongying Zan
MoE
306
4
0
24 Dec 2022
CUNI Submission in WMT22 General Task
Conference on Machine Translation (WMT), 2022
Josef Jon
Martin Popel
Ondrej Bojar
227
6
0
29 Nov 2022
GTrans: Grouping and Fusing Transformer Layers for Neural Machine Translation
IEEE/ACM Transactions on Audio Speech and Language Processing (TASLP), 2022
Jian Yang
Yuwei Yin
Liqun Yang
Shuming Ma
Haoyang Huang
Dongdong Zhang
Furu Wei
Zhoujun Li
AI4CE
286
23
0
29 Jul 2022
Insights into Pre-training via Simpler Synthetic Tasks
Neural Information Processing Systems (NeurIPS), 2022
Yuhuai Wu
Felix Li
Abigail Z. Jacobs
AIMat
293
25
0
21 Jun 2022
Revisiting End-to-End Speech-to-Text Translation From Scratch
International Conference on Machine Learning (ICML), 2022
Biao Zhang
Barry Haddow
Rico Sennrich
221
46
0
09 Jun 2022
Multilingual Neural Machine Translation with Deep Encoder and Multiple Shallow Decoders
Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2022
Xiang Kong
Adithya Renduchintala
James Cross
Yuqing Tang
Jiatao Gu
Xian Li
224
32
0
05 Jun 2022
B2T Connection: Serving Stability and Performance in Deep Transformers
Annual Meeting of the Association for Computational Linguistics (ACL), 2022
Sho Takase
Shun Kiyono
Sosuke Kobayashi
Jun Suzuki
367
17
0
01 Jun 2022
ODE Transformer: An Ordinary Differential Equation-Inspired Model for Sequence Generation
Annual Meeting of the Association for Computational Linguistics (ACL), 2022
Bei Li
Quan Du
Tao Zhou
Yi Jing
Shuhan Zhou
Xin Zeng
Tong Xiao
JingBo Zhu
Xuebo Liu
Min Zhang
243
43
0
17 Mar 2022
Look Backward and Forward: Self-Knowledge Distillation with Bidirectional Decoder for Neural Machine Translation
Xuan Zhang
Libin Shen
Disheng Pan
Liangguo Wang
Yanjun Miao
206
1
0
10 Mar 2022
Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to Practice
International Conference on Learning Representations (ICLR), 2022
Peihao Wang
Wenqing Zheng
Tianlong Chen
Zinan Lin
ViT
332
204
0
09 Mar 2022
DeepNet: Scaling Transformers to 1,000 Layers
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Hongyu Wang
Shuming Ma
Li Dong
Shaohan Huang
Dongdong Zhang
Furu Wei
MoE
AI4CE
415
231
0
01 Mar 2022
Examining Scaling and Transfer of Language Model Architectures for Machine Translation
International Conference on Machine Learning (ICML), 2022
Biao Zhang
Behrooz Ghorbani
Ankur Bapna
Yong Cheng
Xavier Garcia
Jonathan Shen
Orhan Firat
334
30
0
01 Feb 2022
CUNI systems for WMT21: Multilingual Low-Resource Translation for Indo-European Languages Shared Task
Josef Jon
Michal Novák
João Paulo Aires
Duvsan Varivs
Ondrej Bojar
173
3
0
20 Sep 2021
The NiuTrans System for WNGT 2020 Efficiency Task
Chi Hu
Bei Li
Ye Lin
Yinqiao Li
Yanyang Li
Chenglong Wang
Tong Xiao
Jingbo Zhu
109
7
0
16 Sep 2021
The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021
Róbert Csordás
Kazuki Irie
Jürgen Schmidhuber
ViT
460
149
0
26 Aug 2021
Recurrent multiple shared layers in Depth for Neural Machine Translation
Guoliang Li
Yiyang Li
MoE
218
3
0
23 Aug 2021
Tiny Neural Models for Seq2Seq
A. Kandoor
161
0
0
07 Aug 2021
ODE Transformer: An Ordinary Differential Equation-Inspired Model for Neural Machine Translation
Bei Li
Quan Du
Tao Zhou
Shuhan Zhou
Xin Zeng
Tong Xiao
Jingbo Zhu
232
24
0
06 Apr 2021
An Efficient Transformer Decoder with Compressed Sub-layers
AAAI Conference on Artificial Intelligence (AAAI), 2021
Yanyang Li
Ye Lin
Tong Xiao
Jingbo Zhu
331
32
0
03 Jan 2021
Optimizing Deeper Transformers on Small Datasets
Annual Meeting of the Association for Computational Linguistics (ACL), 2020
Peng Xu
Dhruv Kumar
Wei Yang
Wenjie Zi
Keyi Tang
Chenyang Huang
Jackie C.K. Cheung
S. Prince
Yanshuai Cao
AI4CE
396
78
0
30 Dec 2020
Learning Light-Weight Translation Models from Deep Transformer
AAAI Conference on Artificial Intelligence (AAAI), 2020
Bei Li
Ziyang Wang
Hui Liu
Quan Du
Tong Xiao
Chunliang Zhang
Jingbo Zhu
VLM
323
43
0
27 Dec 2020
RealFormer: Transformer Likes Residual Attention
Findings (Findings), 2020
Ruining He
Anirudh Ravula
Bhargav Kanagal
Joshua Ainslie
379
132
0
21 Dec 2020
Improving Gradient Flow with Unrolled Highway Expectation Maximization
C. Song
Eunseok Kim
Inwook Shim
103
2
0
09 Dec 2020
Document Graph for Neural Machine Translation
Mingzhou Xu
Liangyou Li
Derek. F. Wai
Qun Liu
Lidia S. Chao
398
31
0
07 Dec 2020
1
2
Next
Page 1 of 2