ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1910.05895
  4. Cited By
Transformers without Tears: Improving the Normalization of
  Self-Attention

Transformers without Tears: Improving the Normalization of Self-Attention

14 October 2019
Toan Q. Nguyen
Julian Salazar
ArXivPDFHTML

Papers citing "Transformers without Tears: Improving the Normalization of Self-Attention"

50 / 149 papers shown
Title
A Comparative Study on E-Branchformer vs Conformer in Speech
  Recognition, Translation, and Understanding Tasks
A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks
Yifan Peng
Kwangyoun Kim
Felix Wu
Brian Yan
Siddhant Arora
William Chen
Jiyang Tang
Suwon Shon
Prashant Sridhar
Shinji Watanabe
19
17
0
18 May 2023
Exploring the Impact of Layer Normalization for Zero-shot Neural Machine
  Translation
Exploring the Impact of Layer Normalization for Zero-shot Neural Machine Translation
Zhuoyuan Mao
Raj Dabre
Qianying Liu
Haiyue Song
Chenhui Chu
Sadao Kurohashi
11
7
0
16 May 2023
Multi-Path Transformer is Better: A Case Study on Neural Machine
  Translation
Multi-Path Transformer is Better: A Case Study on Neural Machine Translation
Ye Lin
Shuhan Zhou
Yanyang Li
Anxiang Ma
Tong Xiao
Jingbo Zhu
22
0
0
10 May 2023
BranchNorm: Robustly Scaling Extremely Deep Transformers
BranchNorm: Robustly Scaling Extremely Deep Transformers
Yanjun Liu
Xianfeng Zeng
Fandong Meng
Jie Zhou
27
3
0
04 May 2023
DuETT: Dual Event Time Transformer for Electronic Health Records
DuETT: Dual Event Time Transformer for Electronic Health Records
Alex Labach
Aslesha Pokhrel
Xiao Shi Huang
S. Zuberi
S. Yi
M. Volkovs
T. Poutanen
Rahul G. Krishnan
AI4TS
MedIm
20
3
0
25 Apr 2023
Trained on 100 million words and still in shape: BERT meets British
  National Corpus
Trained on 100 million words and still in shape: BERT meets British National Corpus
David Samuel
Andrey Kutuzov
Lilja Øvrelid
Erik Velldal
6
27
0
17 Mar 2023
Stabilizing Transformer Training by Preventing Attention Entropy
  Collapse
Stabilizing Transformer Training by Preventing Attention Entropy Collapse
Shuangfei Zhai
Tatiana Likhomanenko
Etai Littwin
Dan Busbridge
Jason Ramapuram
Yizhe Zhang
Jiatao Gu
J. Susskind
AAML
38
64
0
11 Mar 2023
How Do Transformers Learn Topic Structure: Towards a Mechanistic
  Understanding
How Do Transformers Learn Topic Structure: Towards a Mechanistic Understanding
Yuchen Li
Yuan-Fang Li
Andrej Risteski
107
61
0
07 Mar 2023
Gradient Adjusting Networks for Domain Inversion
Gradient Adjusting Networks for Domain Inversion
Erez Sheffi
Michael Rotman
Lior Wolf
16
0
0
22 Feb 2023
Tighter Bounds on the Expressivity of Transformer Encoders
Tighter Bounds on the Expressivity of Transformer Encoders
David Chiang
Peter A. Cholak
A. Pillay
24
53
0
25 Jan 2023
On Transforming Reinforcement Learning by Transformer: The Development
  Trajectory
On Transforming Reinforcement Learning by Transformer: The Development Trajectory
Shengchao Hu
Li Shen
Ya-Qin Zhang
Yixin Chen
Dacheng Tao
OffRL
23
24
0
29 Dec 2022
Inductive Attention for Video Action Anticipation
Inductive Attention for Video Action Anticipation
Tsung-Ming Tai
G. Fiameni
Cheng-Kuang Lee
Simon See
O. Lanz
31
1
0
17 Dec 2022
Leveraging commonsense for object localisation in partial scenes
Leveraging commonsense for object localisation in partial scenes
Francesco Giuliari
Geri Skenderi
Marco Cristani
Alessio Del Bue
Yiming Wang
25
2
0
01 Nov 2022
A Continuum of Generation Tasks for Investigating Length Bias and
  Degenerate Repetition
A Continuum of Generation Tasks for Investigating Length Bias and Degenerate Repetition
Darcey Riley
David Chiang
17
5
0
19 Oct 2022
MTet: Multi-domain Translation for English and Vietnamese
MTet: Multi-domain Translation for English and Vietnamese
C. Ngo
Trieu H. Trinh
Long Phan
H. Tran
Tai Dang
Hieu Duy Nguyen
Minh Le Nguyen
Minh-Thang Luong
VLM
19
8
0
11 Oct 2022
Transformer Meets Boundary Value Inverse Problems
Transformer Meets Boundary Value Inverse Problems
Ruchi Guo
Shuhao Cao
Long Chen
MedIm
28
20
0
29 Sep 2022
Rethinking Personalized Ranking at Pinterest: An End-to-End Approach
Rethinking Personalized Ranking at Pinterest: An End-to-End Approach
Jiajing Xu
Andrew Zhai
Charles R. Rosenberg
17
17
0
18 Sep 2022
u-HuBERT: Unified Mixed-Modal Speech Pretraining And Zero-Shot Transfer
  to Unlabeled Modality
u-HuBERT: Unified Mixed-Modal Speech Pretraining And Zero-Shot Transfer to Unlabeled Modality
Wei-Ning Hsu
Bowen Shi
SSL
VLM
14
40
0
14 Jul 2022
$L_2$BN: Enhancing Batch Normalization by Equalizing the $L_2$ Norms of
  Features
L2L_2L2​BN: Enhancing Batch Normalization by Equalizing the L2L_2L2​ Norms of Features
Zhennan Wang
Kehan Li
Runyi Yu
Yian Zhao
Pengchong Qiao
Chang-rui Liu
Fan Xu
Xiangyang Ji
Guoli Song
Jie Chen
6
0
0
06 Jul 2022
Semantic Labeling of High Resolution Images Using EfficientUNets and
  Transformers
Semantic Labeling of High Resolution Images Using EfficientUNets and Transformers
Hasan Almarzouqi
L. Saad Saoud
ViT
17
14
0
20 Jun 2022
Pretrained Models for Multilingual Federated Learning
Pretrained Models for Multilingual Federated Learning
Orion Weller
Marc Marone
Vladimir Braverman
Dawn J Lawrie
Benjamin Van Durme
VLM
FedML
AI4CE
29
41
0
06 Jun 2022
3D-C2FT: Coarse-to-fine Transformer for Multi-view 3D Reconstruction
3D-C2FT: Coarse-to-fine Transformer for Multi-view 3D Reconstruction
Leslie Ching Ow Tiong
Dick Sigmund
Andrew Beng Jin Teoh
3DV
ViT
15
12
0
29 May 2022
When More Data Hurts: A Troubling Quirk in Developing Broad-Coverage
  Natural Language Understanding Systems
When More Data Hurts: A Troubling Quirk in Developing Broad-Coverage Natural Language Understanding Systems
Elias Stengel-Eskin
Emmanouil Antonios Platanios
Adam Pauls
Sam Thomson
Hao Fang
Benjamin Van Durme
J. Eisner
Yu-Chuan Su
24
2
0
24 May 2022
Semi-Parametric Inducing Point Networks and Neural Processes
Semi-Parametric Inducing Point Networks and Neural Processes
R. Rastogi
Yair Schiff
Alon Hacohen
Zhaozhi Li
I-Hsiang Lee
Yuntian Deng
M. Sabuncu
Volodymyr Kuleshov
3DPC
19
6
0
24 May 2022
PinnerFormer: Sequence Modeling for User Representation at Pinterest
PinnerFormer: Sequence Modeling for User Representation at Pinterest
Nikil Pancha
Andrew Zhai
J. Leskovec
Charles R. Rosenberg
AI4TS
11
28
0
09 May 2022
Investigating Neural Architectures by Synthetic Dataset Design
Investigating Neural Architectures by Synthetic Dataset Design
Adrien Courtois
Jean-Michel Morel
Pablo Arias
17
4
0
23 Apr 2022
Cross-stitched Multi-modal Encoders
Cross-stitched Multi-modal Encoders
Karan Singla
Daniel Pressel
Ryan Price
Bhargav Srinivas Chinnari
Yeon-Jun Kim
S. Bangalore
16
0
0
20 Apr 2022
GPT-NeoX-20B: An Open-Source Autoregressive Language Model
GPT-NeoX-20B: An Open-Source Autoregressive Language Model
Sid Black
Stella Biderman
Eric Hallahan
Quentin G. Anthony
Leo Gao
...
Shivanshu Purohit
Laria Reynolds
J. Tow
Benqi Wang
Samuel Weinbach
18
797
0
14 Apr 2022
FoundationLayerNorm: Scaling BERT and GPT to 1,000 Layers
FoundationLayerNorm: Scaling BERT and GPT to 1,000 Layers
Dezhou Shen
AI4CE
20
1
0
09 Apr 2022
Tampered VAE for Improved Satellite Image Time Series Classification
Tampered VAE for Improved Satellite Image Time Series Classification
Xin Cai
Y. Bi
Peter Nicholl
AI4TS
9
1
0
30 Mar 2022
Seq-2-Seq based Refinement of ASR Output for Spoken Name Capture
Seq-2-Seq based Refinement of ASR Output for Spoken Name Capture
Karan Singla
S. Jalalvand
Yeon-Jun Kim
Ryan Price
Daniel Pressel
S. Bangalore
8
2
0
29 Mar 2022
DeepNet: Scaling Transformers to 1,000 Layers
DeepNet: Scaling Transformers to 1,000 Layers
Hongyu Wang
Shuming Ma
Li Dong
Shaohan Huang
Dongdong Zhang
Furu Wei
MoE
AI4CE
15
155
0
01 Mar 2022
Overcoming a Theoretical Limitation of Self-Attention
Overcoming a Theoretical Limitation of Self-Attention
David Chiang
Peter A. Cholak
14
76
0
24 Feb 2022
Transformer Quality in Linear Time
Transformer Quality in Linear Time
Weizhe Hua
Zihang Dai
Hanxiao Liu
Quoc V. Le
71
220
0
21 Feb 2022
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A
  Large-Scale Generative Language Model
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
Shaden Smith
M. Patwary
Brandon Norick
P. LeGresley
Samyam Rajbhandari
...
M. Shoeybi
Yuxiong He
Michael Houston
Saurabh Tiwary
Bryan Catanzaro
MoE
22
728
0
28 Jan 2022
Convolutional Xformers for Vision
Convolutional Xformers for Vision
Pranav Jeevan
Amit Sethi
ViT
36
12
0
25 Jan 2022
Faster Nearest Neighbor Machine Translation
Faster Nearest Neighbor Machine Translation
Shuhe Wang
Jiwei Li
Yuxian Meng
Rongbin Ouyang
Guoyin Wang
Xiaoya Li
Tianwei Zhang
Shi Zong
22
12
0
15 Dec 2021
Dynamic Token Normalization Improves Vision Transformers
Dynamic Token Normalization Improves Vision Transformers
Wenqi Shao
Yixiao Ge
Zhaoyang Zhang
Xuyuan Xu
Xiaogang Wang
Ying Shan
Ping Luo
ViT
121
11
0
05 Dec 2021
KNAS: Green Neural Architecture Search
KNAS: Green Neural Architecture Search
Jingjing Xu
Liang Zhao
Junyang Lin
Rundong Gao
Xu Sun
Hongxia Yang
15
55
0
26 Nov 2021
Efficient Self-Ensemble for Semantic Segmentation
Efficient Self-Ensemble for Semantic Segmentation
Walid Bousselham
Guillaume Thibault
Lucas Pagano
Archana Machireddy
Joe W. Gray
Y. Chang
Xubo B. Song
ViT
25
24
0
26 Nov 2021
Attention Approximates Sparse Distributed Memory
Attention Approximates Sparse Distributed Memory
Trenton Bricken
C. Pehlevan
12
34
0
10 Nov 2021
Long-Range Transformers for Dynamic Spatiotemporal Forecasting
Long-Range Transformers for Dynamic Spatiotemporal Forecasting
J. E. Grigsby
Zhe Wang
Nam Nguyen
Yanjun Qi
AI4TS
58
87
0
24 Sep 2021
Can the Transformer Be Used as a Drop-in Replacement for RNNs in
  Text-Generating GANs?
Can the Transformer Be Used as a Drop-in Replacement for RNNs in Text-Generating GANs?
Kevin Blin
Andrei Kucharavy
6
2
0
26 Aug 2021
Variational Graph Normalized Auto-Encoders
Variational Graph Normalized Auto-Encoders
S. Ahn
Myoung-Ho Kim
14
72
0
18 Aug 2021
Knowledge Transfer by Discriminative Pre-training for Academic
  Performance Prediction
Knowledge Transfer by Discriminative Pre-training for Academic Performance Prediction
Byungsoo Kim
Hangyeol Yu
Dongmin Shin
Youngduck Choi
10
1
0
28 Jun 2021
High-probability Bounds for Non-Convex Stochastic Optimization with
  Heavy Tails
High-probability Bounds for Non-Convex Stochastic Optimization with Heavy Tails
Ashok Cutkosky
Harsh Mehta
20
54
0
28 Jun 2021
LegoFormer: Transformers for Block-by-Block Multi-view 3D Reconstruction
LegoFormer: Transformers for Block-by-Block Multi-view 3D Reconstruction
Farid Yagubbayli
Yida Wang
A. Tonioni
Federico Tombari
ViT
8
33
0
23 Jun 2021
Revisiting Deep Learning Models for Tabular Data
Revisiting Deep Learning Models for Tabular Data
Yu. V. Gorishniy
Ivan Rubachev
Valentin Khrulkov
Artem Babenko
LMTD
17
691
0
22 Jun 2021
Multi-head or Single-head? An Empirical Comparison for Transformer
  Training
Multi-head or Single-head? An Empirical Comparison for Transformer Training
Liyuan Liu
Jialu Liu
Jiawei Han
19
31
0
17 Jun 2021
Interpretable Self-supervised Multi-task Learning for COVID-19
  Information Retrieval and Extraction
Interpretable Self-supervised Multi-task Learning for COVID-19 Information Retrieval and Extraction
Nima Ebadi
Peyman Najafirad
4
0
0
15 Jun 2021
Previous
123
Next