Transformers without Tears: Improving the Normalization of Self-Attention

14 October 2019

Papers citing "Transformers without Tears: Improving the Normalization of Self-Attention"

50 / 149 papers shown

Title
A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks Yifan Peng Kwangyoun Kim Felix Wu Brian Yan Siddhant Arora William Chen Jiyang Tang Suwon Shon Prashant Sridhar Shinji Watanabe 19 17 0 18 May 2023
Exploring the Impact of Layer Normalization for Zero-shot Neural Machine Translation Zhuoyuan Mao Raj Dabre Qianying Liu Haiyue Song Chenhui Chu Sadao Kurohashi 11 7 0 16 May 2023
Multi-Path Transformer is Better: A Case Study on Neural Machine Translation Ye Lin Shuhan Zhou Yanyang Li Anxiang Ma Tong Xiao Jingbo Zhu 22 0 0 10 May 2023
BranchNorm: Robustly Scaling Extremely Deep Transformers Yanjun Liu Xianfeng Zeng Fandong Meng Jie Zhou 27 3 0 04 May 2023
DuETT: Dual Event Time Transformer for Electronic Health Records Alex Labach Aslesha Pokhrel Xiao Shi Huang S. Zuberi S. Yi M. Volkovs T. Poutanen Rahul G. Krishnan AI4TS MedIm 20 3 0 25 Apr 2023
Trained on 100 million words and still in shape: BERT meets British National Corpus David Samuel Andrey Kutuzov Lilja Øvrelid Erik Velldal 6 27 0 17 Mar 2023
Stabilizing Transformer Training by Preventing Attention Entropy Collapse Shuangfei Zhai Tatiana Likhomanenko Etai Littwin Dan Busbridge Jason Ramapuram Yizhe Zhang Jiatao Gu J. Susskind AAML 38 64 0 11 Mar 2023
How Do Transformers Learn Topic Structure: Towards a Mechanistic Understanding Yuchen Li Yuan-Fang Li Andrej Risteski 107 61 0 07 Mar 2023
Gradient Adjusting Networks for Domain Inversion Erez Sheffi Michael Rotman Lior Wolf 16 0 0 22 Feb 2023
Tighter Bounds on the Expressivity of Transformer Encoders David Chiang Peter A. Cholak A. Pillay 24 53 0 25 Jan 2023
On Transforming Reinforcement Learning by Transformer: The Development Trajectory Shengchao Hu Li Shen Ya-Qin Zhang Yixin Chen Dacheng Tao OffRL 23 24 0 29 Dec 2022
Inductive Attention for Video Action Anticipation Tsung-Ming Tai G. Fiameni Cheng-Kuang Lee Simon See O. Lanz 31 1 0 17 Dec 2022
Leveraging commonsense for object localisation in partial scenes Francesco Giuliari Geri Skenderi Marco Cristani Alessio Del Bue Yiming Wang 25 2 0 01 Nov 2022
A Continuum of Generation Tasks for Investigating Length Bias and Degenerate Repetition Darcey Riley David Chiang 17 5 0 19 Oct 2022
MTet: Multi-domain Translation for English and Vietnamese C. Ngo Trieu H. Trinh Long Phan H. Tran Tai Dang Hieu Duy Nguyen Minh Le Nguyen Minh-Thang Luong VLM 19 8 0 11 Oct 2022
Transformer Meets Boundary Value Inverse Problems Ruchi Guo Shuhao Cao Long Chen MedIm 28 20 0 29 Sep 2022
Rethinking Personalized Ranking at Pinterest: An End-to-End Approach Jiajing Xu Andrew Zhai Charles R. Rosenberg 17 17 0 18 Sep 2022
u-HuBERT: Unified Mixed-Modal Speech Pretraining And Zero-Shot Transfer to Unlabeled Modality Wei-Ning Hsu Bowen Shi SSL VLM 14 40 0 14 Jul 2022
$L_2$ BN: Enhancing Batch Normalization by Equalizing the $L_2$ Norms of Features Zhennan Wang Kehan Li Runyi Yu Yian Zhao Pengchong Qiao Chang-rui Liu Fan Xu Xiangyang Ji Guoli Song Jie Chen 6 0 0 06 Jul 2022
Semantic Labeling of High Resolution Images Using EfficientUNets and Transformers Hasan Almarzouqi L. Saad Saoud ViT 17 14 0 20 Jun 2022
Pretrained Models for Multilingual Federated Learning Orion Weller Marc Marone Vladimir Braverman Dawn J Lawrie Benjamin Van Durme VLM FedML AI4CE 29 41 0 06 Jun 2022
3D-C2FT: Coarse-to-fine Transformer for Multi-view 3D Reconstruction Leslie Ching Ow Tiong Dick Sigmund Andrew Beng Jin Teoh 3DV ViT 15 12 0 29 May 2022
When More Data Hurts: A Troubling Quirk in Developing Broad-Coverage Natural Language Understanding Systems Elias Stengel-Eskin Emmanouil Antonios Platanios Adam Pauls Sam Thomson Hao Fang Benjamin Van Durme J. Eisner Yu-Chuan Su 24 2 0 24 May 2022
Semi-Parametric Inducing Point Networks and Neural Processes R. Rastogi Yair Schiff Alon Hacohen Zhaozhi Li I-Hsiang Lee Yuntian Deng M. Sabuncu Volodymyr Kuleshov 3DPC 19 6 0 24 May 2022
PinnerFormer: Sequence Modeling for User Representation at Pinterest Nikil Pancha Andrew Zhai J. Leskovec Charles R. Rosenberg AI4TS 11 28 0 09 May 2022
Investigating Neural Architectures by Synthetic Dataset Design Adrien Courtois Jean-Michel Morel Pablo Arias 17 4 0 23 Apr 2022
Cross-stitched Multi-modal Encoders Karan Singla Daniel Pressel Ryan Price Bhargav Srinivas Chinnari Yeon-Jun Kim S. Bangalore 16 0 0 20 Apr 2022
GPT-NeoX-20B: An Open-Source Autoregressive Language Model Sid Black Stella Biderman Eric Hallahan Quentin G. Anthony Leo Gao ... Shivanshu Purohit Laria Reynolds J. Tow Benqi Wang Samuel Weinbach 18 797 0 14 Apr 2022
FoundationLayerNorm: Scaling BERT and GPT to 1,000 Layers Dezhou Shen AI4CE 20 1 0 09 Apr 2022
Tampered VAE for Improved Satellite Image Time Series Classification Xin Cai Y. Bi Peter Nicholl AI4TS 9 1 0 30 Mar 2022
Seq-2-Seq based Refinement of ASR Output for Spoken Name Capture Karan Singla S. Jalalvand Yeon-Jun Kim Ryan Price Daniel Pressel S. Bangalore 8 2 0 29 Mar 2022
DeepNet: Scaling Transformers to 1,000 Layers Hongyu Wang Shuming Ma Li Dong Shaohan Huang Dongdong Zhang Furu Wei MoE AI4CE 15 155 0 01 Mar 2022
Overcoming a Theoretical Limitation of Self-Attention David Chiang Peter A. Cholak 14 76 0 24 Feb 2022
Transformer Quality in Linear Time Weizhe Hua Zihang Dai Hanxiao Liu Quoc V. Le 71 220 0 21 Feb 2022
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model Shaden Smith M. Patwary Brandon Norick P. LeGresley Samyam Rajbhandari ... M. Shoeybi Yuxiong He Michael Houston Saurabh Tiwary Bryan Catanzaro MoE 22 728 0 28 Jan 2022
Convolutional Xformers for Vision Pranav Jeevan Amit Sethi ViT 36 12 0 25 Jan 2022
Faster Nearest Neighbor Machine Translation Shuhe Wang Jiwei Li Yuxian Meng Rongbin Ouyang Guoyin Wang Xiaoya Li Tianwei Zhang Shi Zong 22 12 0 15 Dec 2021
Dynamic Token Normalization Improves Vision Transformers Wenqi Shao Yixiao Ge Zhaoyang Zhang Xuyuan Xu Xiaogang Wang Ying Shan Ping Luo ViT 121 11 0 05 Dec 2021
KNAS: Green Neural Architecture Search Jingjing Xu Liang Zhao Junyang Lin Rundong Gao Xu Sun Hongxia Yang 15 55 0 26 Nov 2021
Efficient Self-Ensemble for Semantic Segmentation Walid Bousselham Guillaume Thibault Lucas Pagano Archana Machireddy Joe W. Gray Y. Chang Xubo B. Song ViT 25 24 0 26 Nov 2021
Attention Approximates Sparse Distributed Memory Trenton Bricken C. Pehlevan 12 34 0 10 Nov 2021
Long-Range Transformers for Dynamic Spatiotemporal Forecasting J. E. Grigsby Zhe Wang Nam Nguyen Yanjun Qi AI4TS 58 87 0 24 Sep 2021
Can the Transformer Be Used as a Drop-in Replacement for RNNs in Text-Generating GANs? Kevin Blin Andrei Kucharavy 6 2 0 26 Aug 2021
Variational Graph Normalized Auto-Encoders S. Ahn Myoung-Ho Kim 14 72 0 18 Aug 2021
Knowledge Transfer by Discriminative Pre-training for Academic Performance Prediction Byungsoo Kim Hangyeol Yu Dongmin Shin Youngduck Choi 10 1 0 28 Jun 2021
High-probability Bounds for Non-Convex Stochastic Optimization with Heavy Tails Ashok Cutkosky Harsh Mehta 20 54 0 28 Jun 2021
LegoFormer: Transformers for Block-by-Block Multi-view 3D Reconstruction Farid Yagubbayli Yida Wang A. Tonioni Federico Tombari ViT 8 33 0 23 Jun 2021
Revisiting Deep Learning Models for Tabular Data Yu. V. Gorishniy Ivan Rubachev Valentin Khrulkov Artem Babenko LMTD 17 691 0 22 Jun 2021
Multi-head or Single-head? An Empirical Comparison for Transformer Training Liyuan Liu Jialu Liu Jiawei Han 19 31 0 17 Jun 2021
Interpretable Self-supervised Multi-task Learning for COVID-19 Information Retrieval and Extraction Nima Ebadi Peyman Najafirad 4 0 0 15 Jun 2021