ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1910.05895
  4. Cited By
Transformers without Tears: Improving the Normalization of
  Self-Attention

Transformers without Tears: Improving the Normalization of Self-Attention

14 October 2019
Toan Q. Nguyen
Julian Salazar
ArXivPDFHTML

Papers citing "Transformers without Tears: Improving the Normalization of Self-Attention"

49 / 149 papers shown
Title
A Survey of Transformers
A Survey of Transformers
Tianyang Lin
Yuxin Wang
Xiangyang Liu
Xipeng Qiu
ViT
25
1,077
0
08 Jun 2021
Choose a Transformer: Fourier or Galerkin
Choose a Transformer: Fourier or Galerkin
Shuhao Cao
14
219
0
31 May 2021
Fast Nearest Neighbor Machine Translation
Fast Nearest Neighbor Machine Translation
Yuxian Meng
Xiaoya Li
Xiayu Zheng
Fei Wu
Xiaofei Sun
Tianwei Zhang
Jiwei Li
LRM
11
49
0
30 May 2021
Rethinking Skip Connection with Layer Normalization in Transformers and
  ResNets
Rethinking Skip Connection with Layer Normalization in Transformers and ResNets
Fenglin Liu
Xuancheng Ren
Zhiyuan Zhang
Xu Sun
Yuexian Zou
AI4CE
14
67
0
15 May 2021
BERT Busters: Outlier Dimensions that Disrupt Transformers
BERT Busters: Outlier Dimensions that Disrupt Transformers
Olga Kovaleva
Saurabh Kulshreshtha
Anna Rogers
Anna Rumshisky
13
85
0
14 May 2021
Global Structure-Aware Drum Transcription Based on Self-Attention
  Mechanisms
Global Structure-Aware Drum Transcription Based on Self-Attention Mechanisms
Ryoto Ishizuka
Ryo Nishikimi
Kazuyoshi Yoshii
19
6
0
12 May 2021
Data Augmentation by Concatenation for Low-Resource Translation: A
  Mystery and a Solution
Data Augmentation by Concatenation for Low-Resource Translation: A Mystery and a Solution
Toan Q. Nguyen
Kenton W. Murray
David Chiang
16
15
0
04 May 2021
Lessons on Parameter Sharing across Layers in Transformers
Lessons on Parameter Sharing across Layers in Transformers
Sho Takase
Shun Kiyono
6
84
0
13 Apr 2021
Joint Universal Syntactic and Semantic Parsing
Joint Universal Syntactic and Semantic Parsing
Elias Stengel-Eskin
Kenton W. Murray
Sheng Zhang
Aaron Steven White
Benjamin Van Durme
22
9
0
12 Apr 2021
Non-autoregressive Transformer-based End-to-end ASR using BERT
Non-autoregressive Transformer-based End-to-end ASR using BERT
Fu-Hao Yu
Kuan-Yu Chen
25
22
0
10 Apr 2021
Keyword Transformer: A Self-Attention Model for Keyword Spotting
Keyword Transformer: A Self-Attention Model for Keyword Spotting
Axel Berg
Mark O'Connor
M. T. Cruz
11
129
0
01 Apr 2021
Pretraining the Noisy Channel Model for Task-Oriented Dialogue
Pretraining the Noisy Channel Model for Task-Oriented Dialogue
Qi Liu
Lei Yu
Laura Rimell
Phil Blunsom
31
26
0
18 Mar 2021
Visual Cues and Error Correction for Translation Robustness
Visual Cues and Error Correction for Translation Robustness
Zhenhao Li
Marek Rei
Lucia Specia
8
3
0
12 Mar 2021
Remote Sensing Image Change Detection with Transformers
Remote Sensing Image Change Detection with Transformers
Hao Chen
Zipeng Qi
Zhenwei Shi
ViT
30
922
0
27 Feb 2021
TransMask: A Compact and Fast Speech Separation Model Based on
  Transformer
TransMask: A Compact and Fast Speech Separation Model Based on Transformer
Zining Zhang
Bingsheng He
Zhenjie Zhang
19
20
0
19 Feb 2021
Fast End-to-End Speech Recognition via Non-Autoregressive Models and
  Cross-Modal Knowledge Transferring from BERT
Fast End-to-End Speech Recognition via Non-Autoregressive Models and Cross-Modal Knowledge Transferring from BERT
Ye Bai
Jiangyan Yi
J. Tao
Zhengkun Tian
Zhengqi Wen
Shuai Zhang
RALM
28
50
0
15 Feb 2021
Optimizing Deeper Transformers on Small Datasets
Optimizing Deeper Transformers on Small Datasets
Peng-Tao Xu
Dhruv Kumar
Wei Yang
Wenjie Zi
Keyi Tang
Chenyang Huang
Jackie C.K. Cheung
S. Prince
Yanshuai Cao
AI4CE
10
68
0
30 Dec 2020
Spatial Temporal Transformer Network for Skeleton-based Action
  Recognition
Spatial Temporal Transformer Network for Skeleton-based Action Recognition
Chiara Plizzari
Marco Cannici
Matteo Matteucci
ViT
22
191
0
11 Dec 2020
ÚFAL at MRP 2020: Permutation-invariant Semantic Parsing in PERIN
ÚFAL at MRP 2020: Permutation-invariant Semantic Parsing in PERIN
David Samuel
Milan Straka
LRM
11
31
0
02 Nov 2020
Dual-decoder Transformer for Joint Automatic Speech Recognition and
  Multilingual Speech Translation
Dual-decoder Transformer for Joint Automatic Speech Recognition and Multilingual Speech Translation
Hang Le
J. Pino
Changhan Wang
Jiatao Gu
D. Schwab
Laurent Besacier
39
82
0
02 Nov 2020
Accelerating Training of Transformer-Based Language Models with
  Progressive Layer Dropping
Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping
Minjia Zhang
Yuxiong He
AI4CE
8
100
0
26 Oct 2020
Align-Refine: Non-Autoregressive Speech Recognition via Iterative
  Realignment
Align-Refine: Non-Autoregressive Speech Recognition via Iterative Realignment
Ethan A. Chi
Julian Salazar
Katrin Kirchhoff
AI4TS
11
51
0
24 Oct 2020
Beyond English-Centric Multilingual Machine Translation
Beyond English-Centric Multilingual Machine Translation
Angela Fan
Shruti Bhosale
Holger Schwenk
Zhiyi Ma
Ahmed El-Kishky
...
Vitaliy Liptchinsky
Sergey Edunov
Edouard Grave
Michael Auli
Armand Joulin
LRM
27
822
0
21 Oct 2020
Unsupervised Bitext Mining and Translation via Self-trained Contextual
  Embeddings
Unsupervised Bitext Mining and Translation via Self-trained Contextual Embeddings
Phillip Keung
Julian Salazar
Y. Lu
Noah A. Smith
SSL
25
25
0
15 Oct 2020
Query-Key Normalization for Transformers
Query-Key Normalization for Transformers
Alex Henry
Prudhvi Raj Dachapally
S. Pawar
Yuxuan Chen
12
75
0
08 Oct 2020
Normalization Techniques in Training DNNs: Methodology, Analysis and
  Application
Normalization Techniques in Training DNNs: Methodology, Analysis and Application
Lei Huang
Jie Qin
Yi Zhou
Fan Zhu
Li Liu
Ling Shao
AI4CE
8
254
0
27 Sep 2020
Review: Deep Learning in Electron Microscopy
Review: Deep Learning in Electron Microscopy
Jeffrey M. Ede
15
79
0
17 Sep 2020
Very Deep Transformers for Neural Machine Translation
Very Deep Transformers for Neural Machine Translation
Xiaodong Liu
Kevin Duh
Liyuan Liu
Jianfeng Gao
6
102
0
18 Aug 2020
Skeleton-based Action Recognition via Spatial and Temporal Transformer
  Networks
Skeleton-based Action Recognition via Spatial and Temporal Transformer Networks
Chiara Plizzari
Marco Cannici
Matteo Matteucci
ViT
MedIm
17
297
0
17 Aug 2020
Towards Understanding Label Smoothing
Towards Understanding Label Smoothing
Yi Tian Xu
Yuanhong Xu
Qi Qian
Hao Li
R. L. Jin
UQCV
16
40
0
20 Jun 2020
Normalized Attention Without Probability Cage
Normalized Attention Without Probability Cage
Oliver Richter
Roger Wattenhofer
6
21
0
19 May 2020
Conformer: Convolution-augmented Transformer for Speech Recognition
Conformer: Convolution-augmented Transformer for Speech Recognition
Anmol Gulati
James Qin
Chung-Cheng Chiu
Niki Parmar
Yu Zhang
...
Wei Han
Shibo Wang
Zhengdong Zhang
Yonghui Wu
Ruoming Pang
8
3,012
0
16 May 2020
Listen Attentively, and Spell Once: Whole Sentence Generation via a
  Non-Autoregressive Architecture for Low-Latency Speech Recognition
Listen Attentively, and Spell Once: Whole Sentence Generation via a Non-Autoregressive Architecture for Low-Latency Speech Recognition
Ye Bai
Jiangyan Yi
J. Tao
Zhengkun Tian
Zhengqi Wen
Shuai Zhang
RALM
15
41
0
11 May 2020
Language Model Prior for Low-Resource Neural Machine Translation
Language Model Prior for Low-Resource Neural Machine Translation
Christos Baziotis
Barry Haddow
Alexandra Birch
13
53
0
30 Apr 2020
Understanding the Difficulty of Training Transformers
Understanding the Difficulty of Training Transformers
Liyuan Liu
Xiaodong Liu
Jianfeng Gao
Weizhu Chen
Jiawei Han
AI4CE
6
243
0
17 Apr 2020
Balancing Training for Multilingual Neural Machine Translation
Balancing Training for Multilingual Neural Machine Translation
Xinyi Wang
Yulia Tsvetkov
Graham Neubig
14
98
0
14 Apr 2020
On Optimal Transformer Depth for Low-Resource Language Translation
On Optimal Transformer Depth for Low-Resource Language Translation
Elan Van Biljon
Arnu Pretorius
Julia Kreutzer
MoE
11
27
0
09 Apr 2020
PowerNorm: Rethinking Batch Normalization in Transformers
PowerNorm: Rethinking Batch Normalization in Transformers
Sheng Shen
Z. Yao
A. Gholami
Michael W. Mahoney
Kurt Keutzer
BDL
11
16
0
17 Mar 2020
ReZero is All You Need: Fast Convergence at Large Depth
ReZero is All You Need: Fast Convergence at Large Depth
Thomas C. Bachlechner
Bodhisattwa Prasad Majumder
H. H. Mao
G. Cottrell
Julian McAuley
AI4CE
8
275
0
10 Mar 2020
On Layer Normalization in the Transformer Architecture
On Layer Normalization in the Transformer Architecture
Ruibin Xiong
Yunchang Yang
Di He
Kai Zheng
Shuxin Zheng
Chen Xing
Huishuai Zhang
Yanyan Lan
Liwei Wang
Tie-Yan Liu
AI4CE
6
938
0
12 Feb 2020
Normalization of Input-output Shared Embeddings in Text Generation
  Models
Normalization of Input-output Shared Embeddings in Text Generation Models
Jinyang Liu
Yujia Zhai
Zizhong Chen
12
0
0
22 Jan 2020
FlauBERT: Unsupervised Language Model Pre-training for French
FlauBERT: Unsupervised Language Model Pre-training for French
Hang Le
Loïc Vial
Jibril Frej
Vincent Segonne
Maximin Coavoux
Benjamin Lecouteux
A. Allauzen
Benoît Crabbé
Laurent Besacier
D. Schwab
AI4CE
18
395
0
11 Dec 2019
A Resource for Computational Experiments on Mapudungun
A Resource for Computational Experiments on Mapudungun
M. Duan
Carlos Fasola
Sai Krishna Rallabandi
R. Vega
Antonios Anastasopoulos
Lori S. Levin
A. Black
4
8
0
04 Dec 2019
Improving Transformer Models by Reordering their Sublayers
Improving Transformer Models by Reordering their Sublayers
Ofir Press
Noah A. Smith
Omer Levy
11
87
0
10 Nov 2019
Masked Language Model Scoring
Masked Language Model Scoring
Julian Salazar
Davis Liang
Toan Q. Nguyen
Katrin Kirchhoff
8
13
0
31 Oct 2019
Stabilizing Transformers for Reinforcement Learning
Stabilizing Transformers for Reinforcement Learning
Emilio Parisotto
H. F. Song
Jack W. Rae
Razvan Pascanu
Çağlar Gülçehre
...
Aidan Clark
Seb Noury
M. Botvinick
N. Heess
R. Hadsell
OffRL
11
359
0
13 Oct 2019
On the adequacy of untuned warmup for adaptive optimization
On the adequacy of untuned warmup for adaptive optimization
Jerry Ma
Denis Yarats
44
70
0
09 Oct 2019
Set Functions for Time Series
Set Functions for Time Series
Max Horn
Michael Moor
Christian Bock
Bastian Alexander Rieck
Karsten M. Borgwardt
AI4TS
16
143
0
26 Sep 2019
Effective Approaches to Attention-based Neural Machine Translation
Effective Approaches to Attention-based Neural Machine Translation
Thang Luong
Hieu H. Pham
Christopher D. Manning
214
7,687
0
17 Aug 2015
Previous
123