ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2407.20018
  4. Cited By
Efficient Training of Large Language Models on Distributed
  Infrastructures: A Survey

Efficient Training of Large Language Models on Distributed Infrastructures: A Survey

29 July 2024
Jiangfei Duan
Shuo Zhang
Zerui Wang
Lijuan Jiang
Wenwen Qu
Qi Hu
Guoteng Wang
Qizhen Weng
Hang Yan
Xingcheng Zhang
Xipeng Qiu
Dahua Lin
Yonggang Wen
Xin Jin
Tianwei Zhang
Peng Sun
ArXivPDFHTML

Papers citing "Efficient Training of Large Language Models on Distributed Infrastructures: A Survey"

15 / 15 papers shown
Title
A Codesign of Scheduling and Parallelization for Large Model Training in
  Heterogeneous Clusters
A Codesign of Scheduling and Parallelization for Large Model Training in Heterogeneous Clusters
Chunyu Xue
Weihao Cui
Han Zhao
Quan Chen
Shulai Zhang
Peng Yang
Jing Yang
Shaobo Li
Minyi Guo
26
2
0
24 Mar 2024
Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a
  Single GPU
Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a Single GPU
Changyue Liao
Mo Sun
Zihan Yang
Kaiqi Chen
Binhang Yuan
Fei Wu
Zeke Wang
30
2
0
11 Mar 2024
SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive
  Validation
SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation
Yifan Xiong
Yuting Jiang
Ziyue Yang
L. Qu
Guoshuai Zhao
...
Luke Melton
Joe Chau
Peng Cheng
Yongqiang Xiong
Lidong Zhou
32
2
0
09 Feb 2024
GMLake: Efficient and Transparent GPU Memory Defragmentation for
  Large-scale DNN Training with Virtual Memory Stitching
GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching
Cong Guo
Rui Zhang
Jiale Xu
Jingwen Leng
Zihan Liu
...
Minyi Guo
Hao Wu
Shouren Zhao
Junping Zhao
Ke Zhang
VLM
49
10
0
16 Jan 2024
A Survey of Resource-efficient LLM and Multimodal Foundation Models
A Survey of Resource-efficient LLM and Multimodal Foundation Models
Mengwei Xu
Wangsong Yin
Dongqi Cai
Rongjie Yi
Daliang Xu
...
Shangguang Wang
Yuanchun Li
Yunxin Liu
Xin Jin
Xuanzhe Liu
VLM
67
70
0
16 Jan 2024
Optimizing Distributed Training on Frontier for Large Language Models
Optimizing Distributed Training on Frontier for Large Language Models
Sajal Dash
Isaac Lyngaas
Junqi Yin
Xiao Wang
Romain Egele
Guojing Cong
Feiyi Wang
Prasanna Balaprakash
ALM
MoE
24
13
0
20 Dec 2023
ALCOP: Automatic Load-Compute Pipelining in Deep Learning Compiler for
  AI-GPUs
ALCOP: Automatic Load-Compute Pipelining in Deep Learning Compiler for AI-GPUs
Guyue Huang
Yang Bai
L. Liu
Yuke Wang
Bei Yu
Yufei Ding
Yuan Xie
30
7
0
29 Oct 2022
ByteTransformer: A High-Performance Transformer Boosted for
  Variable-Length Inputs
ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs
Yujia Zhai
Chengquan Jiang
Leyuan Wang
Xiaoying Jia
Shang Zhang
Zizhong Chen
Xin Liu
Yibo Zhu
44
42
0
06 Oct 2022
Tutel: Adaptive Mixture-of-Experts at Scale
Tutel: Adaptive Mixture-of-Experts at Scale
Changho Hwang
Wei Cui
Yifan Xiong
Ziyue Yang
Ze Liu
...
Joe Chau
Peng Cheng
Fan Yang
Mao Yang
Y. Xiong
MoE
92
107
0
07 Jun 2022
Training language models to follow instructions with human feedback
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
301
11,730
0
04 Mar 2022
Varuna: Scalable, Low-cost Training of Massive Deep Learning Models
Varuna: Scalable, Low-cost Training of Massive Deep Learning Models
Sanjith Athlur
Nitika Saran
Muthian Sivathanu
R. Ramjee
Nipun Kwatra
GNN
23
62
0
07 Nov 2021
Chimera: Efficiently Training Large-Scale Neural Networks with
  Bidirectional Pipelines
Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines
Shigang Li
Torsten Hoefler
GNN
AI4CE
LRM
77
94
0
14 Jul 2021
ZeRO-Offload: Democratizing Billion-Scale Model Training
ZeRO-Offload: Democratizing Billion-Scale Model Training
Jie Ren
Samyam Rajbhandari
Reza Yazdani Aminabadi
Olatunji Ruwase
Shuangyang Yang
Minjia Zhang
Dong Li
Yuxiong He
MoE
155
399
0
18 Jan 2021
Scaling Laws for Neural Language Models
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
220
3,054
0
23 Jan 2020
Megatron-LM: Training Multi-Billion Parameter Language Models Using
  Model Parallelism
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
M. Shoeybi
M. Patwary
Raul Puri
P. LeGresley
Jared Casper
Bryan Catanzaro
MoE
243
1,791
0
17 Sep 2019
1