Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2105.14450
Cited By
Maximizing Parallelism in Distributed Training for Huge Neural Networks
30 May 2021
Zhengda Bian
Qifan Xu
Boxiang Wang
Yang You
MoE
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Maximizing Parallelism in Distributed Training for Huge Neural Networks"
28 / 28 papers shown
Title
TrainVerify: Equivalence-Based Verification for Distributed LLM Training
Yunchi Lu
Youshan Miao
Cheng Tan
Peng Huang
Yi Zhu
Xian Zhang
Fan Yang
LRM
26
0
0
19 Jun 2025
You Don't Need All Attentions: Distributed Dynamic Fine-Tuning for Foundation Models
Shiwei Ding
Lan Zhang
Zhenlin Wang
Giuseppe Ateniese
Xiaoyong Yuan
68
0
0
16 Apr 2025
A Survey on Memory-Efficient Large-Scale Model Training in AI for Science
Kaiyuan Tian
Linbo Qiao
Baihui Liu
Gongqingjian Jiang
Dongsheng Li
106
0
0
21 Jan 2025
NeutronTP: Load-Balanced Distributed Full-Graph GNN Training with Tensor Parallelism
Xin Ai
Hao Yuan
Zeyu Ling
Qiange Wang
Yanfeng Zhang
Zhenbo Fu
Chaoyi Chen
Yu Gu
Ge Yu
GNN
156
1
0
29 Dec 2024
Comprehensive Performance Modeling and System Design Insights for Foundation Models
Shashank Subramanian
Ermal Rrapaj
Peter Harrington
Smeet Chheda
S. Farrell
Brian Austin
Samuel Williams
N. Wright
W. Bhimji
81
0
0
30 Sep 2024
LuWu: An End-to-End In-Network Out-of-Core Optimizer for 100B-Scale Model-in-Network Data-Parallel Training on Distributed GPUs
Mo Sun
Zihan Yang
Changyue Liao
Yingtao Li
Leilei Gan
Zeke Wang
108
1
0
02 Sep 2024
Mixed Sparsity Training: Achieving 4
×
\times
×
FLOP Reduction for Transformer Pretraining
Pihe Hu
Shaolong Li
Longbo Huang
55
0
0
21 Aug 2024
Efficient Training of Large Language Models on Distributed Infrastructures: A Survey
Jiangfei Duan
Shuo Zhang
Zerui Wang
Lijuan Jiang
Wenwen Qu
...
Dahua Lin
Yonggang Wen
Xin Jin
Tianwei Zhang
Peng Sun
154
13
0
29 Jul 2024
LLMem: Estimating GPU Memory Usage for Fine-Tuning Pre-Trained LLMs
Taeho Kim
Yanming Wang
Vatshank Chaturvedi
Lokesh Gupta
Seyeon Kim
Yongin Kwon
Sangtae Ha
53
5
0
16 Apr 2024
Optimizing Malware Detection in IoT Networks: Leveraging Resource-Aware Distributed Computing for Enhanced Security
Sreenitha Kasarapu
Sanket Shukla
Sai Manoj P D
37
0
0
12 Apr 2024
Enhancing IoT Malware Detection through Adaptive Model Parallelism and Resource Optimization
Sreenitha Kasarapu
Sanket Shukla
Sai Manoj P D
42
1
0
12 Apr 2024
ZeroPP: Unleashing Exceptional Parallelism Efficiency through Tensor-Parallelism-Free Methodology
Ding Tang
Lijuan Jiang
Jiecheng Zhou
Minxi Jin
Hengjie Li
Xingcheng Zhang
Zhiling Pei
Jidong Zhai
96
3
0
06 Feb 2024
Training and Serving System of Foundation Models: A Comprehensive Survey
Jiahang Zhou
Yanyu Chen
Zicong Hong
Wuhui Chen
Yue Yu
Tao Zhang
Hui Wang
Chuan-fu Zhang
Zibin Zheng
ALM
94
11
0
05 Jan 2024
The Efficiency Spectrum of Large Language Models: An Algorithmic Survey
Tianyu Ding
Tianyi Chen
Haidong Zhu
Jiachen Jiang
Yiqi Zhong
Jinxin Zhou
Guangzhi Wang
Zhihui Zhu
Ilya Zharkov
Luming Liang
121
24
0
01 Dec 2023
Rethinking Memory and Communication Cost for Efficient Large Language Model Training
Chan Wu
Hanxiao Zhang
Lin Ju
Jinjing Huang
Youshao Xiao
...
Siyuan Li
Fanzhuang Meng
Lei Liang
Xiaolu Zhang
Jun Zhou
45
4
0
09 Oct 2023
DistSim: A performance model of large-scale hybrid distributed DNN training
Guandong Lu
Run Chen
Yakai Wang
Yangjie Zhou
Rui Zhang
...
Yanming Miao
Zhifang Cai
Li-Wei Li
Jingwen Leng
Minyi Guo
86
12
0
14 Jun 2023
A 4D Hybrid Algorithm to Scale Parallel Training to Thousands of GPUs
Siddharth Singh
Prajwal Singhania
Aditya K. Ranjan
Zack Sating
A. Bhatele
79
4
0
22 May 2023
On Efficient Training of Large-Scale Deep Learning Models: A Literature Review
Li Shen
Yan Sun
Zhiyuan Yu
Liang Ding
Xinmei Tian
Dacheng Tao
VLM
105
43
0
07 Apr 2023
OCCL: a Deadlock-free Library for GPU Collective Communication
Lichen Pan
Juncheng Liu
Jinhui Yuan
Rongkai Zhang
Pengze Li
Zhen Xiao
35
1
0
11 Mar 2023
A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training
Siddharth Singh
Olatunji Ruwase
A. A. Awan
Samyam Rajbhandari
Yuxiong He
A. Bhatele
MoE
103
36
0
11 Mar 2023
Colossal-Auto: Unified Automation of Parallelization and Activation Checkpoint for Large-scale Models
Yuliang Liu
Shenggui Li
Jiarui Fang
Yan Shao
Boyuan Yao
Yang You
OffRL
83
7
0
06 Feb 2023
AutoDDL: Automatic Distributed Deep Learning with Near-Optimal Bandwidth Cost
Jinfan Chen
Shigang Li
Ran Guo
Jinhui Yuan
Torsten Hoefler
57
2
0
17 Jan 2023
TAPS: Topology-Aware Intra-Operator Parallelism Strategy Searching Algorithm for Deep Neural Networks
Peng Liang
Hao Zheng
Teng Su
Linbo Qiao
Dongsheng Li
48
0
0
11 Jan 2023
Dive into Big Model Training
Qinghua Liu
Yuxiang Jiang
MoMe
AI4CE
LRM
33
3
0
25 Jul 2022
Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models
Zhiquan Lai
Shengwei Li
Xudong Tang
Ke-shi Ge
Weijie Liu
Yabo Duan
Linbo Qiao
Dongsheng Li
91
46
0
10 Jun 2022
FuncPipe: A Pipelined Serverless Framework for Fast and Cost-efficient Training of Deep Learning Models
Yunzhuo Liu
Bo Jiang
Tian Guo
Zimeng Huang
Wen-ping Ma
Xinbing Wang
Chenghu Zhou
73
9
0
28 Apr 2022
Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training
Yongbin Li
Hongxin Liu
Zhengda Bian
Boxiang Wang
Haichen Huang
Fan Cui
Chuan-Qing Wang
Yang You
GNN
111
149
0
28 Oct 2021
PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management
Jiarui Fang
Zilin Zhu
Shenggui Li
Hui Su
Yang Yu
Jie Zhou
Yang You
VLM
105
25
0
12 Aug 2021
1