Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019

29 August 2019

Papers citing "Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention"

50 / 71 papers shown

Frequency-Aware Token Reduction for Efficient Vision Transformer

232

26 Nov 2025

From Condensation to Rank Collapse: A Two-Stage Analysis of Transformer Training Dynamics

Zheng-an Chen

Tao Luo

AI4CE

167

08 Oct 2025

Scalable Complexity Control Facilitates Reasoning Ability of LLMs

...

238

29 May 2025

Variance Control via Weight Rescaling in LLM Pre-training

Louis Owen

Abhay Kumar

Nilabhra Roy Chowdhury

Fabian Güra

260

21 Mar 2025

HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization

530

06 Mar 2025

The Curse of Depth in Large Language Models

530

09 Feb 2025

Merino: Entropy-driven Design for Generative Language Models on IoT DevicesAAAI Conference on Artificial Intelligence (AAAI), 2024

431

28 Jan 2025

Complexity Control Facilitates Reasoning-Based Compositional Generalization in Transformers

187

15 Jan 2025

Generalized Probabilistic Attention Mechanism in Transformers

DongNyeong Heo

Heeyoul Choi

330

21 Oct 2024

Initialization of Large Language Models via Reparameterization to Mitigate Loss SpikesConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Kosuke Nishida

Kyosuke Nishida

Kuniko Saito

283

07 Oct 2024

Language-Informed Beam Search Decoding for Multilingual Machine TranslationAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

Yilin Yang

Stefan Lee

Prasad Tadepalli

205

11 Aug 2024

Advancing Neural Network Performance through Emergence-Promoting Initialization Scheme

Johnny Jingze Li

V. George

Gabriel A. Silva

ODL

438

26 Jul 2024

Resolving Discrepancies in Compute-Optimal Scaling of Language Models

Ludwig Schmidt

600

27 Jun 2024

Delving into Differentially Private Transformer

551

28 May 2024

Initialization is Critical to Whether Transformers Fit Composite Functions by Reasoning or MemorizingNeural Information Processing Systems (NeurIPS), 2024

735

08 May 2024

Enhancing Length Extrapolation in Sequential Models with Pointer-Augmented Neural Memory

279

18 Apr 2024

Language models scale reliably with over-training and on downstream tasksInternational Conference on Learning Representations (ICLR), 2024

...

Niklas Muennighoff

383

13 Mar 2024

Why Transformers Need Adam: A Hessian Perspective

Ziniu Li

497

103

26 Feb 2024

Spike No More: Stabilizing the Pre-training of Large Language Models

503

28 Dec 2023

Simplifying Transformer BlocksInternational Conference on Learning Representations (ICLR), 2023

Bobby He

Thomas Hofmann

493

03 Nov 2023

Understanding Optimization of Deep Learning via Jacobian Matrix and Lipschitz Constant

Xianbiao Qi

Jianan Wang

Lei Zhang

239

15 Jun 2023

DPFormer: Learning Differentially Private Transformer on Long-Tailed Data

360

28 May 2023

BranchNorm: Robustly Scaling Extremely Deep TransformersAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Yanjun Liu

Xianfeng Zeng

Fandong Meng

Jie Zhou

200

04 May 2023

Are More Layers Beneficial to Graph Transformers?International Conference on Learning Representations (ICLR), 2023

247

01 Mar 2023

Efficient CTC Regularization via Coarse Labels for End-to-End Speech TranslationConference of the European Chapter of the Association for Computational Linguistics (EACL), 2023

Biao Zhang

Barry Haddow

Rico Sennrich

360

21 Feb 2023

Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal PropagationInternational Conference on Learning Representations (ICLR), 2023

290

20 Feb 2023

Optimizing Deep Transformers for Chinese-Thai Low-Resource Translation

306

24 Dec 2022

CUNI Submission in WMT22 General TaskConference on Machine Translation (WMT), 2022

Josef Jon

Martin Popel

Ondrej Bojar

227

29 Nov 2022

GTrans: Grouping and Fusing Transformer Layers for Neural Machine TranslationIEEE/ACM Transactions on Audio Speech and Language Processing (TASLP), 2022

Jian Yang

Yuwei Yin

Liqun Yang

Zhoujun Li

286

29 Jul 2022

Insights into Pre-training via Simpler Synthetic TasksNeural Information Processing Systems (NeurIPS), 2022

293

21 Jun 2022

Revisiting End-to-End Speech-to-Text Translation From ScratchInternational Conference on Machine Learning (ICML), 2022

Biao Zhang

Barry Haddow

Rico Sennrich

221

09 Jun 2022

Multilingual Neural Machine Translation with Deep Encoder and Multiple Shallow DecodersConference of the European Chapter of the Association for Computational Linguistics (EACL), 2022

Xiang Kong

Adithya Renduchintala

James Cross

Yuqing Tang

Jiatao Gu

Xian Li

224

05 Jun 2022

B2T Connection: Serving Stability and Performance in Deep TransformersAnnual Meeting of the Association for Computational Linguistics (ACL), 2022

367

01 Jun 2022

ODE Transformer: An Ordinary Differential Equation-Inspired Model for Sequence GenerationAnnual Meeting of the Association for Computational Linguistics (ACL), 2022

JingBo Zhu

Min Zhang

243

17 Mar 2022

Look Backward and Forward: Self-Knowledge Distillation with Bidirectional Decoder for Neural Machine Translation

206

10 Mar 2022

Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to PracticeInternational Conference on Learning Representations (ICLR), 2022

332

204

09 Mar 2022

DeepNet: Scaling Transformers to 1,000 LayersIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022

415

231

01 Mar 2022

Examining Scaling and Transfer of Language Model Architectures for Machine TranslationInternational Conference on Machine Learning (ICML), 2022

334

01 Feb 2022

CUNI systems for WMT21: Multilingual Low-Resource Translation for Indo-European Languages Shared Task

173

20 Sep 2021

The NiuTrans System for WNGT 2020 Efficiency Task

Jingbo Zhu

109

16 Sep 2021

The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of TransformersConference on Empirical Methods in Natural Language Processing (EMNLP), 2021

460

149

26 Aug 2021

Recurrent multiple shared layers in Depth for Neural Machine Translation

Guoliang Li

Yiyang Li

MoE

218

23 Aug 2021

Tiny Neural Models for Seq2Seq

A. Kandoor

161

07 Aug 2021

ODE Transformer: An Ordinary Differential Equation-Inspired Model for Neural Machine Translation

Jingbo Zhu

232

06 Apr 2021

An Efficient Transformer Decoder with Compressed Sub-layersAAAI Conference on Artificial Intelligence (AAAI), 2021

Yanyang Li

Ye Lin

Tong Xiao

Jingbo Zhu

331

03 Jan 2021

Optimizing Deeper Transformers on Small DatasetsAnnual Meeting of the Association for Computational Linguistics (ACL), 2020

396

30 Dec 2020

Learning Light-Weight Translation Models from Deep TransformerAAAI Conference on Artificial Intelligence (AAAI), 2020

Jingbo Zhu

323

27 Dec 2020

RealFormer: Transformer Likes Residual AttentionFindings (Findings), 2020

Ruining He

Anirudh Ravula

Bhargav Kanagal

Joshua Ainslie

379

132

21 Dec 2020

Improving Gradient Flow with Unrolled Highway Expectation Maximization

C. Song

Eunseok Kim

Inwook Shim

103

09 Dec 2020

Document Graph for Neural Machine Translation

Qun Liu

398

07 Dec 2020