Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2203.00555
Cited By
DeepNet: Scaling Transformers to 1,000 Layers
1 March 2022
Hongyu Wang
Shuming Ma
Li Dong
Shaohan Huang
Dongdong Zhang
Furu Wei
MoE
AI4CE
Re-assign community
ArXiv
PDF
HTML
Papers citing
"DeepNet: Scaling Transformers to 1,000 Layers"
50 / 97 papers shown
Title
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Zihan Qiu
Zekun Wang
Bo Zheng
Zeyu Huang
Kaiyue Wen
...
Fei Huang
Suozhi Huang
Dayiheng Liu
Jingren Zhou
Junyang Lin
MoE
28
0
0
10 May 2025
Don't be lazy: CompleteP enables compute-efficient deep transformers
Nolan Dey
Bin Claire Zhang
Lorenzo Noci
Mufan Bill Li
Blake Bordelon
Shane Bergsma
C. Pehlevan
Boris Hanin
Joel Hestness
39
0
0
02 May 2025
Dense Backpropagation Improves Training for Sparse Mixture-of-Experts
Ashwinee Panda
Vatsal Baherwani
Zain Sarwar
Benjamin Thérien
Supriyo Chakraborty
Tom Goldstein
MoE
39
0
0
16 Apr 2025
Model Hemorrhage and the Robustness Limits of Large Language Models
Ziyang Ma
Zehan Li
L. Zhang
Gui-Song Xia
Bo Du
Liangpei Zhang
Dacheng Tao
59
0
0
31 Mar 2025
Exploring the Roles of Large Language Models in Reshaping Transportation Systems: A Survey, Framework, and Roadmap
Tong Nie
Jian Sun
Wei Ma
72
1
0
27 Mar 2025
HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization
Zhijian Zhuo
Yutao Zeng
Ya Wang
Sijun Zhang
Jian Yang
Xiaoqing Li
Xun Zhou
Jinwen Ma
51
0
0
06 Mar 2025
Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam
Tianjin Huang
Haotian Hu
Zhenyu (Allen) Zhang
Gaojie Jin
Xianrui Li
...
Tianlong Chen
Lu Liu
Qingsong Wen
Zhangyang Wang
Shiwei Liu
MQ
39
0
0
24 Feb 2025
The Curse of Depth in Large Language Models
Wenfang Sun
Xinyuan Song
Pengxiang Li
Lu Yin
Yefeng Zheng
Shiwei Liu
72
4
0
09 Feb 2025
SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training
Tianjin Huang
Ziquan Zhu
Gaojie Jin
Lu Liu
Zhangyang Wang
Shiwei Liu
44
1
0
12 Jan 2025
Foundations of GenIR
Qingyao Ai
Jingtao Zhan
Yong-Jin Liu
51
0
0
06 Jan 2025
Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN
Pengxiang Li
Lu Yin
Shiwei Liu
70
4
0
18 Dec 2024
Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning
Yangqiu Song
Tong Zheng
R. Wang
Jiahao Liu
Qingyan Guo
...
Xu Tan
Tong Xiao
Jingbo Zhu
J. Wang
Xunliang Cai
58
1
0
05 Nov 2024
Training Compute-Optimal Protein Language Models
Xingyi Cheng
Bo Chen
Pan Li
Jing Gong
Jie Tang
Le Song
84
13
0
04 Nov 2024
Initialization of Large Language Models via Reparameterization to Mitigate Loss Spikes
Kosuke Nishida
Kyosuke Nishida
Kuniko Saito
33
1
0
07 Oct 2024
Mastering Chess with a Transformer Model
Daniel Monroe
The Leela Chess Zero Team
29
3
0
18 Sep 2024
LCS: A Language Converter Strategy for Zero-Shot Neural Machine Translation
Zengkui Sun
Yijin Liu
Fandong Meng
Jinan Xu
Yufeng Chen
Jie Zhou
45
2
0
05 Jun 2024
Transformers Can Do Arithmetic with the Right Embeddings
Sean McLeish
Arpit Bansal
Alex Stein
Neel Jain
John Kirchenbauer
...
B. Kailkhura
A. Bhatele
Jonas Geiping
Avi Schwarzschild
Tom Goldstein
53
28
0
27 May 2024
Initialization is Critical to Whether Transformers Fit Composite Functions by Reasoning or Memorizing
Zhongwang Zhang
Pengxiao Lin
Zhiwei Wang
Yaoyu Zhang
Z. Xu
39
3
0
08 May 2024
ShortGPT: Layers in Large Language Models are More Redundant Than You Expect
Xin Men
Mingyu Xu
Qingyu Zhang
Bingning Wang
Hongyu Lin
Yaojie Lu
Xianpei Han
Weipeng Chen
25
103
0
06 Mar 2024
Why Transformers Need Adam: A Hessian Perspective
Yushun Zhang
Congliang Chen
Tian Ding
Ziniu Li
Ruoyu Sun
Zhimin Luo
37
43
0
26 Feb 2024
xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein
Bo Chen
Xingyi Cheng
Pan Li
Yangli-ao Geng
Jing Gong
...
Chiming Liu
Aohan Zeng
Yuxiao Dong
Jie Tang
Leo T. Song
42
101
0
11 Jan 2024
FourCastNeXt: Optimizing FourCastNet Training for Limited Compute
Edison Guo
Maruf Ahmed
Yue Sun
Rui Yang
Harrison Cook
Tennessee Leeuwenburg
Ben Evans
10
1
0
10 Jan 2024
SpecFormer: Guarding Vision Transformer Robustness via Maximum Singular Value Penalization
Xixu Hu
Runkai Zheng
Jindong Wang
Cheuk Hang Leung
Qi Wu
Xing Xie
35
1
0
02 Jan 2024
Heterogeneous Encoders Scaling In The Transformer For Neural Machine Translation
J. Hu
Roberto Cavicchioli
Giulia Berardinelli
Alessandro Capotondi
41
2
0
26 Dec 2023
Efficient LLM inference solution on Intel GPU
Hui Wu
Yi Gan
Feng Yuan
Jing Ma
Wei Zhu
...
Hong Zhu
Yuhua Zhu
Xiaoli Liu
Jinghui Gu
Peng Zhao
24
3
0
19 Dec 2023
Simplifying Transformer Blocks
Bobby He
Thomas Hofmann
25
30
0
03 Nov 2023
Unpacking the Ethical Value Alignment in Big Models
Xiaoyuan Yi
Jing Yao
Xiting Wang
Xing Xie
24
11
0
26 Oct 2023
MindLLM: Pre-training Lightweight Large Language Model from Scratch, Evaluations and Domain Applications
Yizhe Yang
Huashan Sun
Jiawei Li
Runheng Liu
Yinghao Li
Yuhang Liu
Heyan Huang
Yang Gao
ALM
LRM
10
8
0
24 Oct 2023
Modular Speech-to-Text Translation for Zero-Shot Cross-Modal Transfer
Paul-Ambroise Duquenne
Holger Schwenk
Benoît Sagot
42
3
0
05 Oct 2023
Astroconformer: The Prospects of Analyzing Stellar Light Curves with Transformer-Based Deep Learning Models
Kishankumar Bhimani
Yuan-Sen Ting
Jie Yu
16
4
0
28 Sep 2023
Masked Image Residual Learning for Scaling Deeper Vision Transformers
Guoxi Huang
Hongtao Fu
A. Bors
34
7
0
25 Sep 2023
CoMFLP: Correlation Measure based Fast Search on ASR Layer Pruning
W. Liu
Zhiyuan Peng
Tan Lee
11
1
0
21 Sep 2023
Implicit regularization of deep residual networks towards neural ODEs
P. Marion
Yu-Han Wu
Michael E. Sander
Gérard Biau
34
14
0
03 Sep 2023
Retentive Network: A Successor to Transformer for Large Language Models
Yutao Sun
Li Dong
Shaohan Huang
Shuming Ma
Yuqing Xia
Jilong Xue
Jianyong Wang
Furu Wei
LRM
75
301
0
17 Jul 2023
A Comprehensive Overview of Large Language Models
Humza Naveed
Asad Ullah Khan
Shi Qiu
Muhammad Saqib
Saeed Anwar
Muhammad Usman
Naveed Akhtar
Nick Barnes
Ajmal Saeed Mian
OffRL
70
525
0
12 Jul 2023
LongNet: Scaling Transformers to 1,000,000,000 Tokens
Jiayu Ding
Shuming Ma
Li Dong
Xingxing Zhang
Shaohan Huang
Wenhui Wang
Nanning Zheng
Furu Wei
CLL
41
151
0
05 Jul 2023
Spike-driven Transformer
Man Yao
Jiakui Hu
Zhaokun Zhou
Liuliang Yuan
Yonghong Tian
Boxing Xu
Guoqi Li
34
114
0
04 Jul 2023
Nonparametric Classification on Low Dimensional Manifolds using Overparameterized Convolutional Residual Networks
Kaiqi Zhang
Zixuan Zhang
Minshuo Chen
Yuma Takeda
Mengdi Wang
Tuo Zhao
Yu-Xiang Wang
32
0
0
04 Jul 2023
H
2
_2
2
O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
Zhenyu (Allen) Zhang
Ying Sheng
Dinesh Manocha
Tianlong Chen
Lianmin Zheng
...
Yuandong Tian
Christopher Ré
Clark W. Barrett
Zhangyang Wang
Beidi Chen
VLM
49
254
0
24 Jun 2023
Understanding Optimization of Deep Learning via Jacobian Matrix and Lipschitz Constant
Xianbiao Qi
Jianan Wang
Lei Zhang
15
0
0
15 Jun 2023
Understanding Parameter Sharing in Transformers
Ye Lin
Mingxuan Wang
Zhexi Zhang
Xiaohui Wang
Tong Xiao
Jingbo Zhu
MoE
21
2
0
15 Jun 2023
MobileNMT: Enabling Translation in 15MB and 30ms
Ye Lin
Xiaohui Wang
Zhexi Zhang
Mingxuan Wang
Tong Xiao
Jingbo Zhu
MQ
30
1
0
07 Jun 2023
Unlocking the Potential of Federated Learning for Deeper Models
Hao Wang
Xuefeng Liu
Jianwei Niu
Shaojie Tang
Jiaxing Shen
FedML
AI4CE
9
1
0
05 Jun 2023
MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training
Yizhi Li
Ruibin Yuan
Ge Zhang
Yi Ma
Xingran Chen
...
Yemin Shi
Wen-Fen Huang
Zili Wang
Yi-Ting Guo
Jie Fu
25
108
0
31 May 2023
Neural Machine Translation with Dynamic Graph Convolutional Decoder
Lei Li
Kai Fan
Ling Yang
Hongjian Li
Chun Yuan
40
4
0
28 May 2023
VanillaKD: Revisit the Power of Vanilla Knowledge Distillation from Small Scale to Large Scale
Zhiwei Hao
Jianyuan Guo
Kai Han
Han Hu
Chang Xu
Yunhe Wang
35
16
0
25 May 2023
VanillaNet: the Power of Minimalism in Deep Learning
Hanting Chen
Yunhe Wang
Jianyuan Guo
Dacheng Tao
VLM
34
85
0
22 May 2023
Less is More! A slim architecture for optimal language translation
Luca Herranz-Celotti
E. Rrapaj
28
0
0
18 May 2023
Towards Understanding and Improving Knowledge Distillation for Neural Machine Translation
Songming Zhang
Yunlong Liang
Shuaibo Wang
Wenjuan Han
Jian Liu
Jinan Xu
Yufeng Chen
23
8
0
14 May 2023
Multi-Path Transformer is Better: A Case Study on Neural Machine Translation
Ye Lin
Shuhan Zhou
Yanyang Li
Anxiang Ma
Tong Xiao
Jingbo Zhu
32
0
0
10 May 2023
1
2
Next