Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2203.00555
Cited By
DeepNet: Scaling Transformers to 1,000 Layers
1 March 2022
Hongyu Wang
Shuming Ma
Li Dong
Shaohan Huang
Dongdong Zhang
Furu Wei
MoE
AI4CE
Re-assign community
ArXiv
PDF
HTML
Papers citing
"DeepNet: Scaling Transformers to 1,000 Layers"
47 / 97 papers shown
Title
BranchNorm: Robustly Scaling Extremely Deep Transformers
Yanjun Liu
Xianfeng Zeng
Fandong Meng
Jie Zhou
32
3
0
04 May 2023
ResiDual: Transformer with Dual Residual Connections
Shufang Xie
Huishuai Zhang
Junliang Guo
Xu Tan
Jiang Bian
Hany Awadalla
Arul Menezes
Tao Qin
Rui Yan
51
18
0
28 Apr 2023
LipsFormer: Introducing Lipschitz Continuity to Vision Transformers
Xianbiao Qi
Jianan Wang
Yihao Chen
Yukai Shi
Lei Zhang
43
16
0
19 Apr 2023
StageInteractor: Query-based Object Detector with Cross-stage Interaction
Yao Teng
Haisong Liu
Sheng Guo
Limin Wang
ObjD
34
8
0
11 Apr 2023
Scaling Pre-trained Language Models to Deeper via Parameter-efficient Architecture
Peiyu Liu
Ze-Feng Gao
Yushuo Chen
Wayne Xin Zhao
Ji-Rong Wen
MoE
32
0
0
27 Mar 2023
Stabilizing Transformer Training by Preventing Attention Entropy Collapse
Shuangfei Zhai
Tatiana Likhomanenko
Etai Littwin
Dan Busbridge
Jason Ramapuram
Yizhe Zhang
Jiatao Gu
J. Susskind
AAML
46
64
0
11 Mar 2023
Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers
Tianlong Chen
Zhenyu (Allen) Zhang
Ajay Jaiswal
Shiwei Liu
Zhangyang Wang
MoE
38
46
0
02 Mar 2023
Are More Layers Beneficial to Graph Transformers?
Haiteng Zhao
Shuming Ma
Dongdong Zhang
Zhi-Hong Deng
Furu Wei
27
12
0
01 Mar 2023
Language Is Not All You Need: Aligning Perception with Language Models
Shaohan Huang
Li Dong
Wenhui Wang
Y. Hao
Saksham Singhal
...
Johan Bjorck
Vishrav Chaudhary
Subhojit Som
Xia Song
Furu Wei
VLM
LRM
MLLM
23
535
0
27 Feb 2023
Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation
Bobby He
James Martens
Guodong Zhang
Aleksandar Botev
Andy Brock
Samuel L. Smith
Yee Whye Teh
21
30
0
20 Feb 2023
Image Super-Resolution using Efficient Striped Window Transformer
Jinpeng Shi
Hui Li
Tian Yu Liu
Yulong Liu
M. Zhang
Jinchen Zhu
Ling Zheng
Shizhuang Weng
34
10
0
24 Jan 2023
Memory-efficient NLLB-200: Language-specific Expert Pruning of a Massively Multilingual Machine Translation Model
Yeskendir Koishekenov
Alexandre Berard
Vassilina Nikoulina
MoE
35
29
0
19 Dec 2022
BEATs: Audio Pre-Training with Acoustic Tokenizers
Sanyuan Chen
Yu-Huan Wu
Chengyi Wang
Shujie Liu
Daniel C. Tompkins
Zhuo Chen
Furu Wei
41
255
0
18 Dec 2022
Improving Generalization of Pre-trained Language Models via Stochastic Weight Averaging
Peng Lu
I. Kobyzev
Mehdi Rezagholizadeh
Ahmad Rashid
A. Ghodsi
Philippe Langlais
MoMe
33
11
0
12 Dec 2022
Beyond Mahalanobis-Based Scores for Textual OOD Detection
Pierre Colombo
Eduardo Dadalto Camara Gomes
Guillaume Staerman
Nathan Noiry
Pablo Piantanida
OODD
52
5
0
24 Nov 2022
TorchScale: Transformers at Scale
Shuming Ma
Hongyu Wang
Shaohan Huang
Wenhui Wang
Zewen Chi
...
Alon Benhaim
Barun Patra
Vishrav Chaudhary
Xia Song
Furu Wei
AI4CE
14
10
0
23 Nov 2022
Learning from partially labeled data for multi-organ and tumor segmentation
Yutong Xie
Jianpeng Zhang
Yong-quan Xia
Chunhua Shen
MedIm
ViT
33
18
0
13 Nov 2022
SpeechMatrix: A Large-Scale Mined Corpus of Multilingual Speech-to-Speech Translations
Paul-Ambroise Duquenne
Hongyu Gong
Ning Dong
Jingfei Du
Ann Lee
Vedanuj Goswani
Changhan Wang
J. Pino
Benoît Sagot
Holger Schwenk
39
34
0
08 Nov 2022
Foundation Transformers
Hongyu Wang
Shuming Ma
Shaohan Huang
Li Dong
Wenhui Wang
...
Barun Patra
Zhun Liu
Vishrav Chaudhary
Xia Song
Furu Wei
AI4CE
35
27
0
12 Oct 2022
Mixture of Attention Heads: Selecting Attention Heads Per Token
Xiaofeng Zhang
Yikang Shen
Zeyu Huang
Jie Zhou
Wenge Rong
Zhang Xiong
MoE
99
42
0
11 Oct 2022
GLM-130B: An Open Bilingual Pre-trained Model
Aohan Zeng
Xiao Liu
Zhengxiao Du
Zihan Wang
Hanyu Lai
...
Jidong Zhai
Wenguang Chen
Peng-Zhen Zhang
Yuxiao Dong
Jie Tang
BDL
LRM
250
1,073
0
05 Oct 2022
E-Branchformer: Branchformer with Enhanced merging for speech recognition
Kwangyoun Kim
Felix Wu
Yifan Peng
Jing Pan
Prashant Sridhar
Kyu Jeong Han
Shinji Watanabe
58
105
0
30 Sep 2022
Informative Language Representation Learning for Massively Multilingual Neural Machine Translation
Renren Jin
Deyi Xiong
25
4
0
04 Sep 2022
Deep Sparse Conformer for Speech Recognition
Xianchao Wu
20
2
0
01 Sep 2022
Transformers with Learnable Activation Functions
Haishuo Fang
Ji-Ung Lee
N. Moosavi
Iryna Gurevych
AI4CE
25
7
0
30 Aug 2022
ANAct: Adaptive Normalization for Activation Functions
Peiwen Yuan
Henan Liu
Changsheng Zhu
Yuyi Wang
13
1
0
29 Aug 2022
Pure Transformers are Powerful Graph Learners
Jinwoo Kim
Tien Dat Nguyen
Seonwoo Min
Sungjun Cho
Moontae Lee
Honglak Lee
Seunghoon Hong
43
189
0
06 Jul 2022
Insights into Pre-training via Simpler Synthetic Tasks
Yuhuai Wu
Felix Li
Percy Liang
AIMat
26
20
0
21 Jun 2022
Scaling ResNets in the Large-depth Regime
P. Marion
Adeline Fermanian
Gérard Biau
Jean-Philippe Vert
26
16
0
14 Jun 2022
Language Models are General-Purpose Interfaces
Y. Hao
Haoyu Song
Li Dong
Shaohan Huang
Zewen Chi
Wenhui Wang
Shuming Ma
Furu Wei
MLLM
27
95
0
13 Jun 2022
Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse
Lorenzo Noci
Sotiris Anagnostidis
Luca Biggio
Antonio Orvieto
Sidak Pal Singh
Aurelien Lucchi
61
65
0
07 Jun 2022
Entangled Residual Mappings
Mathias Lechner
Ramin Hasani
Z. Babaiee
Radu Grosu
Daniela Rus
T. Henzinger
Sepp Hochreiter
11
4
0
02 Jun 2022
VL-BEiT: Generative Vision-Language Pretraining
Hangbo Bao
Wenhui Wang
Li Dong
Furu Wei
VLM
10
45
0
02 Jun 2022
Squeezeformer: An Efficient Transformer for Automatic Speech Recognition
Sehoon Kim
A. Gholami
Albert Eaton Shaw
Nicholas Lee
K. Mangalam
Jitendra Malik
Michael W. Mahoney
Kurt Keutzer
32
99
0
02 Jun 2022
B2T Connection: Serving Stability and Performance in Deep Transformers
Sho Takase
Shun Kiyono
Sosuke Kobayashi
Jun Suzuki
16
10
0
01 Jun 2022
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Tri Dao
Daniel Y. Fu
Stefano Ermon
Atri Rudra
Christopher Ré
VLM
75
2,024
0
27 May 2022
Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages
Kevin Heffernan
Onur cCelebi
Holger Schwenk
25
53
0
25 May 2022
T-Modules: Translation Modules for Zero-Shot Cross-Modal Machine Translation
Paul-Ambroise Duquenne
Hongyu Gong
Benoît Sagot
Holger Schwenk
24
18
0
24 May 2022
What Do Compressed Multilingual Machine Translation Models Forget?
Alireza Mohammadshahi
Vassilina Nikoulina
Alexandre Berard
Caroline Brun
James Henderson
Laurent Besacier
AI4CE
42
9
0
22 May 2022
A Study on Transformer Configuration and Training Objective
Fuzhao Xue
Jianghai Chen
Aixin Sun
Xiaozhe Ren
Zangwei Zheng
Xiaoxin He
Yongming Chen
Xin Jiang
Yang You
33
7
0
21 May 2022
Supplementary Material: Implementation and Experiments for GAU-based Model
Zhenjie Liu
17
0
0
12 May 2022
FoundationLayerNorm: Scaling BERT and GPT to 1,000 Layers
Dezhou Shen
AI4CE
20
1
0
09 Apr 2022
LiteTransformerSearch: Training-free Neural Architecture Search for Efficient Language Models
Mojan Javaheripi
Gustavo de Rosa
Subhabrata Mukherjee
S. Shah
Tomasz Religa
C. C. T. Mendes
Sébastien Bubeck
F. Koushanfar
Debadeepta Dey
31
18
0
04 Mar 2022
On the Implicit Bias Towards Minimal Depth of Deep Neural Networks
Tomer Galanti
Liane Galanti
Ido Ben-Shaul
16
12
0
18 Feb 2022
data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language
Alexei Baevski
Wei-Ning Hsu
Qiantong Xu
Arun Babu
Jiatao Gu
Michael Auli
SSL
VLM
ViT
35
836
0
07 Feb 2022
Domain Prompt Learning for Efficiently Adapting CLIP to Unseen Domains
X. Zhang
S. Gu
Yutaka Matsuo
Yusuke Iwasawa
VLM
38
36
0
25 Nov 2021
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
258
4,489
0
23 Jan 2020
Previous
1
2