Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2203.06211
Cited By
Staged Training for Transformer Language Models
International Conference on Machine Learning (ICML), 2022
11 March 2022
Sheng Shen
Pete Walsh
Kurt Keutzer
Jesse Dodge
Matthew E. Peters
Iz Beltagy
Re-assign community
ArXiv (abs)
PDF
HTML
Github (32★)
Papers citing
"Staged Training for Transformer Language Models"
37 / 37 papers shown
Efficient-Husformer: Efficient Multimodal Transformer Hyperparameter Optimization for Stress and Cognitive Loads
Merey Orazaly
Fariza Temirkhanova
Jurn-Gyu Park
68
0
0
27 Nov 2025
Deep Progressive Training: scaling up depth capacity of zero/one-layer models
Zhiqi Bu
AI4CE
133
0
0
07 Nov 2025
SCALE: Upscaled Continual Learning of Large Language Models
Jin-woo Lee
Junhwa Choi
Bongkyu Hwang
Jinho Choo
Bogun Kim
...
Joonseok Lee
DongYoung Jung
Jaeseon Park
Kyoungwon Park
Suk-hoon Jung
CLL
LRM
514
0
0
05 Nov 2025
ScaleNet: Scaling up Pretrained Neural Networks with Incremental Parameters
Zhiwei Hao
Jianyuan Guo
Li Shen
Kai Han
Yehui Tang
Han Hu
Yunhe Wang
234
0
0
21 Oct 2025
Recycling Pretrained Checkpoints: Orthogonal Growth of Mixture-of-Experts for Efficient Large Language Model Pre-Training
Ruizhe Wang
Yucheng Ding
Xiao Liu
Yaoxiang Wang
Peng Cheng
Baining Guo
Zhengjun Zha
Yeyun Gong
145
0
0
09 Oct 2025
Mid-Training of Large Language Models: A Survey
Kaixiang Mo
Yuxin Shi
Weiwei Weng
Zhiqiang Zhou
Shuman Liu
Haibo Zhang
Anxiang Zeng
LRM
152
0
0
08 Oct 2025
Sparse Training Scheme for Multimodal LLM
Kean Shi
Liang Chen
Haozhe Zhao
Baobao Chang
118
0
0
16 Sep 2025
Expanding Foundational Language Capabilities in Open-Source LLMs through a Korean Case Study
Junghwan Lim
Gangwon Jo
S. W. Lee
Jiyoung Park
Dongseok Kim
...
Haesol Lee
Jeesoo Lee
Dongpin Oh
Changseok Song
Daewon Suh
96
1
0
04 Sep 2025
LongCat-Flash Technical Report
M-A-P Team
Bayan
Bei Li
Bingye Lei
Bo Wang
...
Rongxiang Weng
Ruichen Shao
Rumei Li
Shizhe Wu
Shuai Liang
MLLM
MoE
VLM
425
16
0
01 Sep 2025
Progressive Depth Up-scaling via Optimal Transport
Mingzi Cao
Xi Wang
Nikolaos Aletras
83
1
0
11 Aug 2025
Frozen Layers: Memory-efficient Many-fidelity Hyperparameter Optimization
Timur Carstensen
Neeratyoy Mallik
Katharina Eggensperger
Martin Rapp
AI4CE
335
1
0
14 Apr 2025
STEP: Staged Parameter-Efficient Pre-training for Large Language Models
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Kazuki Yano
Takumi Ito
Jun Suzuki
LRM
298
3
0
05 Apr 2025
Stacking as Accelerated Gradient Descent
Naman Agarwal
Pranjal Awasthi
Satyen Kale
Eric Zhao
ODL
280
5
0
20 Feb 2025
Gap Preserving Distillation by Building Bidirectional Mappings with A Dynamic Teacher
International Conference on Learning Representations (ICLR), 2024
Yong Guo
Shulian Zhang
Haolin Pan
Jing Liu
Yulun Zhang
Jian Chen
269
0
0
05 Oct 2024
Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization
Mohammad Samragh
Iman Mirzadeh
Keivan Alizadeh Vahid
Fartash Faghri
Minsik Cho
Moin Nabi
Devang Naik
Mehrdad Farajtabar
LRM
AI4CE
322
20
0
19 Sep 2024
DLO: Dynamic Layer Operation for Efficient Vertical Scaling of LLMs
Zhen Tan
Daize Dong
Xinyu Zhao
Jie Peng
Yu Cheng
Tianlong Chen
MoE
207
10
0
03 Jul 2024
LLaVolta: Efficient Multi-modal Models via Stage-wise Visual Context Compression
Jieneng Chen
Luoxin Ye
Ju He
Zhao-Yang Wang
Daniel Khashabi
Alan Yuille
VLM
173
1
0
28 Jun 2024
Landscape-Aware Growing: The Power of a Little LAG
Stefani Karp
Nikunj Saunshi
Sobhan Miryoosefi
Sashank J. Reddi
Sanjiv Kumar
268
1
0
04 Jun 2024
Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training
Wenyu Du
Tongxu Luo
Zihan Qiu
Zeyu Huang
Songlin Yang
Reynold Cheng
Wenhan Luo
Jie Fu
220
34
0
24 May 2024
A Multi-Level Framework for Accelerating Training Transformer Models
Longwei Zou
Han Zhang
Yangdong Deng
AI4CE
284
3
0
07 Apr 2024
Efficient Stagewise Pretraining via Progressive Subnetworks
Abhishek Panigrahi
Nikunj Saunshi
Kaifeng Lyu
Sobhan Miryoosefi
Sashank J. Reddi
Satyen Kale
Sanjiv Kumar
189
8
0
08 Feb 2024
HiFT: A Hierarchical Full Parameter Fine-Tuning Strategy
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Yongkang Liu
Yiqun Zhang
Qian Li
Tong Liu
Shi Feng
Daling Wang
Yifei Zhang
Hinrich Schütze
305
15
0
26 Jan 2024
Preparing Lessons for Progressive Training on Language Models
AAAI Conference on Artificial Intelligence (AAAI), 2024
Yu Pan
Ye Yuan
Yichun Yin
Jiaxin Shi
Zenglin Xu
Ming Zhang
Lifeng Shang
Xin Jiang
Qun Liu
269
13
0
17 Jan 2024
LLaMA Pro: Progressive LLaMA with Block Expansion
Chengyue Wu
Yukang Gan
Yixiao Ge
Zeyu Lu
Jiahao Wang
Ye Feng
Ying Shan
Ping Luo
CLL
241
99
0
04 Jan 2024
Navigating Scaling Laws: Compute Optimality in Adaptive Model Training
International Conference on Machine Learning (ICML), 2023
Sotiris Anagnostidis
Gregor Bachmann
Imanol Schlag
Thomas Hofmann
354
2
0
06 Nov 2023
Seeking Neural Nuggets: Knowledge Transfer in Large Language Models from a Parametric Perspective
International Conference on Learning Representations (ICLR), 2023
Ming Zhong
Chenxin An
Weizhu Chen
Jiawei Han
Pengcheng He
367
16
0
17 Oct 2023
Reusing Pretrained Models by Multi-linear Operators for Efficient Training
Yu Pan
Ye Yuan
Yichun Yin
Zenglin Xu
Lifeng Shang
Xin Jiang
Qun Liu
257
17
0
16 Oct 2023
LEMON: Lossless model expansion
International Conference on Learning Representations (ICLR), 2023
Yite Wang
Jiahao Su
Hanlin Lu
Cong Xie
Tianyi Liu
Jianbo Yuan
Yanghua Peng
Tian Ding
Hongxia Yang
224
20
0
12 Oct 2023
FLM-101B: An Open LLM and How to Train It with $100K Budget
Xiang Li
Yiqun Yao
Xin Jiang
Xuezhi Fang
Xuying Meng
...
Li Du
Bowen Qin
Zheng Zhang
Aixin Sun
Yequan Wang
463
27
0
07 Sep 2023
Composable Function-preserving Expansions for Transformer Architectures
Andrea Gesmundo
Kaitlin Maile
AI4CE
257
10
0
11 Aug 2023
No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models
Neural Information Processing Systems (NeurIPS), 2023
Jean Kaddour
Oscar Key
Piotr Nawrot
Pasquale Minervini
Matt J. Kusner
439
58
0
12 Jul 2023
Deep Fusion: Efficient Network Training via Pre-trained Initializations
International Conference on Machine Learning (ICML), 2023
Hanna Mazzawi
X. Gonzalvo
Michael Wunder
Sammy Jerome
Benoit Dherin
AI4CE
526
3
0
20 Jun 2023
Masked Structural Growth for 2x Faster Language Model Pre-training
International Conference on Learning Representations (ICLR), 2023
Yiqun Yao
Zheng Zhang
Jing Li
Yequan Wang
OffRL
AI4CE
LRM
314
27
0
04 May 2023
Learning to Grow Pretrained Models for Efficient Transformer Training
International Conference on Learning Representations (ICLR), 2023
Peihao Wang
Yikang Shen
Lucas Torroba Hennigen
P. Greengard
Leonid Karlinsky
Rogerio Feris
David D. Cox
Zinan Lin
Yoon Kim
203
71
0
02 Mar 2023
Cramming: Training a Language Model on a Single GPU in One Day
International Conference on Machine Learning (ICML), 2022
Jonas Geiping
Tom Goldstein
MoE
276
103
0
28 Dec 2022
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints
International Conference on Learning Representations (ICLR), 2022
Aran Komatsuzaki
J. Puigcerver
James Lee-Thorp
Carlos Riquelme Ruiz
Basil Mustafa
Joshua Ainslie
Yi Tay
Mostafa Dehghani
N. Houlsby
MoMe
MoE
239
170
0
09 Dec 2022
Random-LTD: Random and Layerwise Token Dropping Brings Efficient Training for Large-scale Transformers
Z. Yao
Xiaoxia Wu
Conglong Li
Connor Holmes
Minjia Zhang
Cheng-rong Li
Yuxiong He
192
13
0
17 Nov 2022
1
Page 1 of 1