ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2405.15319
21
4

Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training

24 May 2024
Wenyu Du
Tongxu Luo
Zihan Qiu
Zeyu Huang
Yikang Shen
Reynold Cheng
Yike Guo
Jie Fu
ArXivPDFHTML
Abstract

LLMs are computationally expensive to pre-train due to their large scale. Model growth emerges as a promising approach by leveraging smaller models to accelerate the training of larger ones. However, the viability of these model growth methods in efficient LLM pre-training remains underexplored. This work identifies three critical O‾\underline{\textit{O}}O​bstacles: (O\textit{O}O1) lack of comprehensive evaluation, (O\textit{O}O2) untested viability for scaling, and (O\textit{O}O3) lack of empirical guidelines. To tackle O\textit{O}O1, we summarize existing approaches into four atomic growth operators and systematically evaluate them in a standardized LLM pre-training setting. Our findings reveal that a depthwise stacking operator, called GstackG_{\text{stack}}Gstack​, exhibits remarkable acceleration in training, leading to decreased loss and improved overall performance on eight standard NLP benchmarks compared to strong baselines. Motivated by these promising results, we conduct extensive experiments to delve deeper into GstackG_{\text{stack}}Gstack​ to address O\textit{O}O2 and O\textit{O}O3. For O\textit{O}O2 (untested scalability), our study shows that GstackG_{\text{stack}}Gstack​ is scalable and consistently performs well, with experiments up to 7B LLMs after growth and pre-training LLMs with 750B tokens. For example, compared to a conventionally trained 7B model using 300B tokens, our GstackG_{\text{stack}}Gstack​ model converges to the same loss with 194B tokens, resulting in a 54.6\% speedup. We further address O\textit{O}O3 (lack of empirical guidelines) by formalizing guidelines to determine growth timing and growth factor for GstackG_{\text{stack}}Gstack​, making it practical in general LLM pre-training. We also provide in-depth discussions and comprehensive ablation studies of GstackG_{\text{stack}}Gstack​. Our code and pre-trained model are available at \href\href{https://llm-stacking.github.io/}{https://llm-stacking.github.io/}\href.

View on arXiv
Comments on this paper