Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2006.09503
Cited By
Memory-Efficient Pipeline-Parallel DNN Training
16 June 2020
Deepak Narayanan
Amar Phanishayee
Kaiyu Shi
Xie Chen
Matei A. Zaharia
MoE
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Memory-Efficient Pipeline-Parallel DNN Training"
50 / 109 papers shown
Title
Large Language Model Partitioning for Low-Latency Inference at the Edge
Dimitrios Kafetzis
Ramin Khalili
Iordanis Koutsopoulos
24
0
0
05 May 2025
Nesterov Method for Asynchronous Pipeline Parallel Optimization
Thalaiyasingam Ajanthan
Sameera Ramasinghe
Yan Zuo
Gil Avraham
Alexander Long
24
0
0
02 May 2025
Galvatron: An Automatic Distributed System for Efficient Foundation Model Training
Xinyi Liu
Y. Wang
Shenhan Zhu
Fangcheng Fu
Qingshuo Liu
Guangming Lin
Bin Cui
GNN
140
0
0
30 Apr 2025
MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core
Dennis Liu
Zijie Yan
Xin Yao
Tong Liu
V. Korthikanti
...
Jiajie Yao
Chandler Zhou
David Wu
Xipeng Li
J. Yang
MoE
65
0
0
21 Apr 2025
Mist: Efficient Distributed Training of Large Language Models via Memory-Parallelism Co-Optimization
Zhanda Zhu
Christina Giannoula
Muralidhar Andoorveedu
Qidong Su
Karttikeya Mangalam
Bojian Zheng
Gennady Pekhimenko
VLM
MoE
51
0
0
24 Mar 2025
ATTENTION2D: Communication Efficient Distributed Self-Attention Mechanism
Venmugil Elango
48
0
0
20 Mar 2025
Ferret: An Efficient Online Continual Learning Framework under Varying Memory Constraints
Yuhao Zhou
Yuxin Tian
Jindi Lv
Mingjia Shi
Yuanxi Li
Qing Ye
Shuhao Zhang
Jiancheng Lv
CLL
72
0
0
15 Mar 2025
Automatic Operator-level Parallelism Planning for Distributed Deep Learning -- A Mixed-Integer Programming Approach
Ruifeng She
Bowen Pang
Kai Li
Zehua Liu
Tao Zhong
61
0
0
12 Mar 2025
MoETuner: Optimized Mixture of Expert Serving with Balanced Expert Placement and Token Routing
Seokjin Go
Divya Mahajan
MoE
67
0
0
10 Feb 2025
A Survey on Memory-Efficient Large-Scale Model Training in AI for Science
Kaiyuan Tian
Linbo Qiao
Baihui Liu
Gongqingjian Jiang
Dongsheng Li
33
0
0
21 Jan 2025
Prediction-Assisted Online Distributed Deep Learning Workload Scheduling in GPU Clusters
Ziyue Luo
Jia-Wei Liu
Myungjin Lee
Ness B. Shroff
41
0
0
09 Jan 2025
FlexSP: Accelerating Large Language Model Training via Flexible Sequence Parallelism
Y. Wang
Shiju Wang
Shenhan Zhu
Fangcheng Fu
Xinyi Liu
Xuefeng Xiao
Huixia Li
Jiashi Li
Faming Wu
Bin Cui
93
3
0
02 Dec 2024
Accelerating Large Language Model Training with 4D Parallelism and Memory Consumption Estimator
Kazuki Fujii
Kohei Watanabe
Rio Yokota
32
0
0
10 Nov 2024
Acceleration for Deep Reinforcement Learning using Parallel and Distributed Computing: A Survey
Zhihong Liu
Xin Xu
Peng Qiao
Dongsheng Li
OffRL
22
2
0
08 Nov 2024
Cephalo: Harnessing Heterogeneous GPU Clusters for Training Transformer Models
Runsheng Benson Guo
Utkarsh Anand
Arthur Chen
Khuzaima Daudjee
42
1
0
01 Nov 2024
BitPipe: Bidirectional Interleaved Pipeline Parallelism for Accelerating Large Models Training
Houming Wu
Ling Chen
Wenjie Yu
AI4CE
17
0
0
25 Oct 2024
TiMePReSt: Time and Memory Efficient Pipeline Parallel DNN Training with Removed Staleness
Ankita Dutta
Nabendu Chaki
Rajat K. De
27
0
0
18 Oct 2024
Malleus: Straggler-Resilient Hybrid Parallel Training of Large-scale Models via Malleable Data and Model Parallelization
Haoyang Li
Fangcheng Fu
Hao Ge
Sheng Lin
Xuanyu Wang
Jiawen Niu
Y. Wang
Hailin Zhang
Xiaonan Nie
Bin Cui
MoMe
33
2
0
17 Oct 2024
FreeRide: Harvesting Bubbles in Pipeline Parallelism
Jiashu Zhang
Zihan Pan
Molly
Xu
Khuzaima S. Daudjee
90
0
0
11 Sep 2024
LuWu: An End-to-End In-Network Out-of-Core Optimizer for 100B-Scale Model-in-Network Data-Parallel Training on Distributed GPUs
Mo Sun
Zihan Yang
Changyue Liao
Yingtao Li
Fei Wu
Zeke Wang
54
1
0
02 Sep 2024
Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference
Joyjit Kundu
Wenzhe Guo
Ali BanaGozar
Udari De Alwis
Sourav Sengupta
Puneet Gupta
Arindam Mallik
37
3
0
19 Jul 2024
Integrated Hardware Architecture and Device Placement Search
Irene Wang
Jakub Tarnawski
Amar Phanishayee
Divya Mahajan
33
1
0
18 Jul 2024
GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism
Byungsoo Jeon
Mengdi Wu
Shiyi Cao
Sunghyun Kim
Sunghyun Park
...
Xupeng Miao
Mohammad Alizadeh
G. R. Ganger
Tianqi Chen
Zhihao Jia
GNN
AI4CE
61
5
0
24 Jun 2024
Optimizing Large Model Training through Overlapped Activation Recomputation
Ping Chen
Wenjie Zhang
Shuibing He
Yingjie Gu
Zhuwei Peng
...
Yi Zheng
Zhefeng Wang
Yanlong Yin
Gang Chen
Gang Chen
35
5
0
13 Jun 2024
Seq1F1B: Efficient Sequence-Level Pipeline Parallelism for Large Language Model Training
Ao Sun
Weilin Zhao
Xu Han
Cheng Yang
Zhiyuan Liu
Chuan Shi
Maosong Sun
31
7
0
05 Jun 2024
PETRA: Parallel End-to-end Training with Reversible Architectures
Stephane Rivaud
Louis Fournier
Thomas Pumir
Eugene Belilovsky
Michael Eickenberg
Edouard Oyallon
23
0
0
04 Jun 2024
ACCO: Accumulate While You Communicate for Communication-Overlapped Sharded LLM Training
Adel Nabli
Louis Fournier
Pierre Erbacher
Louis Serrano
Eugene Belilovsky
Edouard Oyallon
FedML
46
1
0
03 Jun 2024
KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation
Minsik Cho
Mohammad Rastegari
Devang Naik
32
4
0
08 May 2024
Continual Pre-Training for Cross-Lingual LLM Adaptation: Enhancing Japanese Language Capabilities
Kazuki Fujii
Taishi Nakamura
Mengsay Loem
Hiroki Iida
Masanari Ohi
Kakeru Hattori
Hirai Shota
Sakae Mizuki
Rio Yokota
Naoaki Okazaki
CLL
41
53
0
27 Apr 2024
Workload-Aware Hardware Accelerator Mining for Distributed Deep Learning Training
Muhammad Adnan
Amar Phanishayee
Janardhan Kulkarni
Prashant J. Nair
Divyat Mahajan
39
0
0
23 Apr 2024
Parcae: Proactive, Liveput-Optimized DNN Training on Preemptible Instances
Jiangfei Duan
Ziang Song
Xupeng Miao
Xiaoli Xi
Dahua Lin
Harry Xu
Minjia Zhang
Zhihao Jia
46
10
0
21 Mar 2024
DiPaCo: Distributed Path Composition
Arthur Douillard
Qixuang Feng
Andrei A. Rusu
A. Kuncoro
Yani Donchev
Rachita Chhaparia
Ionel Gog
MarcÁurelio Ranzato
Jiajun Shen
Arthur Szlam
MoE
48
2
0
15 Mar 2024
Cyclic Data Parallelism for Efficient Parallelism of Deep Neural Networks
Louis Fournier
Edouard Oyallon
40
0
0
13 Mar 2024
PartIR: Composing SPMD Partitioning Strategies for Machine Learning
Sami Alabed
Daniel Belov
Bart Chrzaszcz
Juliana Franco
Dominik Grewe
...
Michael Schaarschmidt
Timur Sitdikov
Agnieszka Swietlik
Dimitrios Vytiniotis
Joel Wee
28
3
0
20 Jan 2024
xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein
Bo Chen
Xingyi Cheng
Pan Li
Yangli-ao Geng
Jing Gong
...
Chiming Liu
Aohan Zeng
Yuxiao Dong
Jie Tang
Leo T. Song
42
101
0
11 Jan 2024
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
Bin Lin
Chen Zhang
Tao Peng
Hanyu Zhao
Wencong Xiao
...
Shen Li
Zhigang Ji
Tao Xie
Yong Li
Wei Lin
44
46
0
05 Jan 2024
Training and Serving System of Foundation Models: A Comprehensive Survey
Jiahang Zhou
Yanyu Chen
Zicong Hong
Wuhui Chen
Yue Yu
Tao Zhang
Hui Wang
Chuan-fu Zhang
Zibin Zheng
ALM
32
5
0
05 Jan 2024
Re-evaluating the Memory-balanced Pipeline Parallelism: BPipe
Mincong Huang
Chao Wang
Chi Ma
Yineng Zhang
Peng Zhang
Lei Yu
25
1
0
04 Jan 2024
Ravnest: Decentralized Asynchronous Training on Heterogeneous Devices
A. Menon
Unnikrishnan Menon
Kailash Ahirwar
21
1
0
03 Jan 2024
Unicron: Economizing Self-Healing LLM Training at Scale
Tao He
Xue Li
Zhibin Wang
Kun Qian
Jingbo Xu
Wenyuan Yu
Jingren Zhou
19
14
0
30 Dec 2023
Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems
Xupeng Miao
Gabriele Oliaro
Zhihao Zhang
Xinhao Cheng
Hongyi Jin
Tianqi Chen
Zhihao Jia
65
76
0
23 Dec 2023
Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference
Hongzheng Chen
Jiahao Zhang
Yixiao Du
Shaojie Xiang
Zichao Yue
Niansong Zhang
Yaohui Cai
Zhiru Zhang
53
34
0
23 Dec 2023
EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism
Yanxi Chen
Xuchen Pan
Yaliang Li
Bolin Ding
Jingren Zhou
LRM
41
31
0
08 Dec 2023
Moirai: Towards Optimal Placement for Distributed Inference on Heterogeneous Devices
Beibei Zhang
Hongwei Zhu
Feng Gao
Zhihui Yang
Xiaoyang Sean Wang
29
1
0
07 Dec 2023
Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment
Fei Yang
Shuang Peng
Ning Sun
Fangyu Wang
Ke Tan
Fu Wu
Jiezhong Qiu
Aimin Pan
22
4
0
06 Dec 2023
The Efficiency Spectrum of Large Language Models: An Algorithmic Survey
Tianyu Ding
Tianyi Chen
Haidong Zhu
Jiachen Jiang
Yiqi Zhong
Jinxin Zhou
Guangzhi Wang
Zhihui Zhu
Ilya Zharkov
Luming Liang
27
22
0
01 Dec 2023
PipeOptim: Ensuring Effective 1F1B Schedule with Optimizer-Dependent Weight Prediction
Lei Guan
Dongsheng Li
Jiye Liang
Wenjian Wang
Wenjian Wang
Xicheng Lu
30
1
0
01 Dec 2023
vTrain: A Simulation Framework for Evaluating Cost-effective and Compute-optimal Large Language Model Training
Jehyeon Bang
Yujeong Choi
Myeongwoo Kim
Yongdeok Kim
Minsoo Rhu
27
15
0
27 Nov 2023
HongTu: Scalable Full-Graph GNN Training on Multiple GPUs (via communication-optimized CPU data offloading)
Qiange Wang
Yao Chen
Weng-Fai Wong
Bingsheng He
GNN
23
9
0
25 Nov 2023
AccEPT: An Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training
Yuhao Chen
Yuxuan Yan
Qianqian Yang
Yuanchao Shu
Shibo He
Zhiguo Shi
Jiming Chen
35
0
0
10 Nov 2023
1
2
3
Next