ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2104.07857
  4. Cited By
ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep
  Learning

ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning

International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2021
16 April 2021
Samyam Rajbhandari
Olatunji Ruwase
Jeff Rasley
Shaden Smith
Yuxiong He
    GNN
ArXiv (abs)PDFHTML

Papers citing "ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning"

50 / 235 papers shown
ZeRO++: Extremely Efficient Collective Communication for Giant Model
  Training
ZeRO++: Extremely Efficient Collective Communication for Giant Model Training
Guanhua Wang
Heyang Qin
S. A. Jacobs
Connor Holmes
Samyam Rajbhandari
Olatunji Ruwase
Feng Yan
Lei Yang
Yuxiong He
VLM
222
78
0
16 Jun 2023
Full Parameter Fine-tuning for Large Language Models with Limited
  Resources
Full Parameter Fine-tuning for Large Language Models with Limited ResourcesAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Kai Lv
Yuqing Yang
Tengxiao Liu
Qi-jie Gao
Qipeng Guo
Xipeng Qiu
330
186
0
16 Jun 2023
Proteus: Simulating the Performance of Distributed DNN Training
Proteus: Simulating the Performance of Distributed DNN TrainingIEEE Transactions on Parallel and Distributed Systems (TPDS), 2023
Jiangfei Duan
Xiuhong Li
Ping Xu
Xingcheng Zhang
Shengen Yan
Yun Liang
Dahua Lin
223
13
0
04 Jun 2023
Adam Accumulation to Reduce Memory Footprints of both Activations and
  Gradients for Large-scale DNN Training
Adam Accumulation to Reduce Memory Footprints of both Activations and Gradients for Large-scale DNN TrainingEuropean Conference on Artificial Intelligence (ECAI), 2023
Yijia Zhang
Yibo Han
Shijie Cao
Guohao Dai
Youshan Miao
Ting Cao
Fan Yang
Ningyi Xu
118
5
0
31 May 2023
Automated Tensor Model Parallelism with Overlapped Communication for
  Efficient Foundation Model Training
Automated Tensor Model Parallelism with Overlapped Communication for Efficient Foundation Model TrainingIEEE Transactions on Parallel and Distributed Systems (TPDS), 2023
Shengwei Li
Zhiquan Lai
Yanqi Hao
Weijie Liu
Ke-shi Ge
Xiaoge Deng
Dongsheng Li
KaiCheng Lu
175
11
0
25 May 2023
Scaling Speech Technology to 1,000+ Languages
Scaling Speech Technology to 1,000+ LanguagesJournal of machine learning research (JMLR), 2023
Vineel Pratap
Andros Tjandra
Bowen Shi
Paden Tomasello
Arun Babu
...
Yossi Adi
Xiaohui Zhang
Wei-Ning Hsu
Alexis Conneau
Michael Auli
VLM
391
522
0
22 May 2023
OSDP: Optimal Sharded Data Parallel for Distributed Deep LearningInternational Joint Conference on Artificial Intelligence (IJCAI), 2022
Youhe Jiang
Fangcheng Fu
Xupeng Miao
Xiaonan Nie
Tengjiao Wang
217
14
0
17 May 2023
DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP
  Training
DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP TrainingComputer Vision and Pattern Recognition (CVPR), 2023
Yihao Chen
Xianbiao Qi
Jianan Wang
Lei Zhang
175
24
0
17 Apr 2023
On Efficient Training of Large-Scale Deep Learning Models: A Literature
  Review
On Efficient Training of Large-Scale Deep Learning Models: A Literature Review
Li Shen
Yan Sun
Zhiyuan Yu
Liang Ding
Xinmei Tian
Dacheng Tao
VLM
296
51
0
07 Apr 2023
The Online Pause and Resume Problem: Optimal Algorithms and An
  Application to Carbon-Aware Load Shifting
The Online Pause and Resume Problem: Optimal Algorithms and An Application to Carbon-Aware Load ShiftingProceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS), 2023
Adam Lechowicz
Nicolas H. Christianson
Jinhang Zuo
Noman Bashir
Mohammad Hajiesmaili
Adam Wierman
Prashant J. Shenoy
175
28
0
30 Mar 2023
ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems
  for Large-model Training at Scale
ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at ScaleIEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2023
William Won
Taekyung Heo
Saeed Rashidi
Srinivas Sridharan
Sudarshan Srinivasan
T. Krishna
142
83
0
24 Mar 2023
FlexGen: High-Throughput Generative Inference of Large Language Models
  with a Single GPU
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPUInternational Conference on Machine Learning (ICML), 2023
Ying Sheng
Lianmin Zheng
Binhang Yuan
Zhuohan Li
Max Ryabinin
...
Joseph E. Gonzalez
Abigail Z. Jacobs
Christopher Ré
Ion Stoica
Ce Zhang
451
575
0
13 Mar 2023
A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize
  Mixture-of-Experts Training
A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts TrainingInternational Conference on Supercomputing (ICS), 2023
Siddharth Singh
Olatunji Ruwase
A. A. Awan
Samyam Rajbhandari
Yuxiong He
A. Bhatele
MoE
218
64
0
11 Mar 2023
Angel-PTM: A Scalable and Economical Large-scale Pre-training System in
  Tencent
Angel-PTM: A Scalable and Economical Large-scale Pre-training System in TencentProceedings of the VLDB Endowment (PVLDB), 2023
Xiaonan Nie
Yi Liu
Fangcheng Fu
Jinbao Xue
Dian Jiao
Xupeng Miao
Yangyu Tao
Tengjiao Wang
MoE
205
24
0
06 Mar 2023
SWIFT: Expedited Failure Recovery for Large-scale DNN Training
SWIFT: Expedited Failure Recovery for Large-scale DNN TrainingIEEE Transactions on Parallel and Distributed Systems (TPDS), 2023
Keon Jang
Hassan M. G. Wassel
Behnam Montazeri
Michael Ryan
David Wetherall
162
17
0
13 Feb 2023
Exploiting Sparsity in Pruned Neural Networks to Optimize Large Model
  Training
Exploiting Sparsity in Pruned Neural Networks to Optimize Large Model TrainingIEEE International Parallel and Distributed Processing Symposium (IPDPS), 2023
Siddharth Singh
A. Bhatele
275
10
0
10 Feb 2023
Computation vs. Communication Scaling for Future Transformers on Future
  Hardware
Computation vs. Communication Scaling for Future Transformers on Future Hardware
Suchita Pati
Shaizeen Aga
Mahzabeen Islam
Nuwan Jayasena
Matthew D. Sinclair
262
14
0
06 Feb 2023
Colossal-Auto: Unified Automation of Parallelization and Activation
  Checkpoint for Large-scale Models
Colossal-Auto: Unified Automation of Parallelization and Activation Checkpoint for Large-scale Models
Yuliang Liu
Shenggui Li
Jiarui Fang
Yan Shao
Boyuan Yao
Yang You
OffRL
216
11
0
06 Feb 2023
SWARM Parallelism: Training Large Models Can Be Surprisingly
  Communication-Efficient
SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-EfficientInternational Conference on Machine Learning (ICML), 2023
Max Ryabinin
Tim Dettmers
Michael Diskin
Alexander Borzunov
MoE
365
55
0
27 Jan 2023
Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware
  Communication Compression
Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication CompressionInternational Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2023
Jaeyong Song
Jinkyu Yim
Jaewon Jung
Hongsun Jang
H. Kim
Youngsok Kim
Jinho Lee
GNN
274
39
0
24 Jan 2023
ATP: Adaptive Tensor Parallelism for Foundation Models
ATP: Adaptive Tensor Parallelism for Foundation Models
Shenggan Cheng
Ziming Liu
Jiangsu Du
Yang You
138
11
0
20 Jan 2023
Systems for Parallel and Distributed Large-Model Deep Learning Training
Systems for Parallel and Distributed Large-Model Deep Learning Training
Kabir Nagrecha
GNNVLMMoE
154
9
0
06 Jan 2023
Elixir: Train a Large Language Model on a Small GPU Cluster
Elixir: Train a Large Language Model on a Small GPU Cluster
Haichen Huang
Jiarui Fang
Hongxin Liu
Shenggui Li
Yang You
VLM
250
10
0
10 Dec 2022
Deep Incubation: Training Large Models by Divide-and-Conquering
Deep Incubation: Training Large Models by Divide-and-ConqueringIEEE International Conference on Computer Vision (ICCV), 2022
Zanlin Ni
Yulin Wang
Jiangwei Yu
Haojun Jiang
Yu Cao
Gao Huang
VLM
243
13
0
08 Dec 2022
COMET: A Comprehensive Cluster Design Methodology for Distributed Deep
  Learning Training
COMET: A Comprehensive Cluster Design Methodology for Distributed Deep Learning Training
D. Kadiyala
Saeed Rashidi
Taekyung Heo
Abhimanyu Bambhaniya
T. Krishna
Alexandros Daglis
VLM
172
11
0
30 Nov 2022
PipeFisher: Efficient Training of Large Language Models Using Pipelining
  and Fisher Information Matrices
PipeFisher: Efficient Training of Large Language Models Using Pipelining and Fisher Information MatricesConference on Machine Learning and Systems (MLSys), 2022
Kazuki Osawa
Shigang Li
Torsten Hoefler
AI4CE
206
36
0
25 Nov 2022
Breadth-First Pipeline Parallelism
Breadth-First Pipeline Parallelism
J. Lamy-Poirier
GNNMoEAI4CE
121
1
0
11 Nov 2022
On Optimizing the Communication of Model Parallelism
On Optimizing the Communication of Model ParallelismConference on Machine Learning and Systems (MLSys), 2022
Yonghao Zhuang
Hexu Zhao
Lianmin Zheng
Zhuohan Li
Eric P. Xing
Qirong Ho
Joseph E. Gonzalez
Ion Stoica
Haotong Zhang
197
40
0
10 Nov 2022
Tempo: Accelerating Transformer-Based Model Training through Memory
  Footprint Reduction
Tempo: Accelerating Transformer-Based Model Training through Memory Footprint ReductionNeural Information Processing Systems (NeurIPS), 2022
Muralidhar Andoorveedu
Zhanda Zhu
Bojian Zheng
Gennady Pekhimenko
185
8
0
19 Oct 2022
Mimose: An Input-Aware Checkpointing Planner for Efficient Training on
  GPU
Mimose: An Input-Aware Checkpointing Planner for Efficient Training on GPU
Jian-He Liao
Mingzhen Li
Qingxiao Sun
Jiwei Hao
F. Yu
...
Ye Tao
Zicheng Zhang
Hailong Yang
Zhongzhi Luan
D. Qian
146
4
0
06 Sep 2022
Petals: Collaborative Inference and Fine-tuning of Large Models
Petals: Collaborative Inference and Fine-tuning of Large ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2022
Alexander Borzunov
Dmitry Baranchuk
Tim Dettmers
Max Ryabinin
Younes Belkada
Artem Chumachenko
Pavel Samygin
Colin Raffel
VLM
224
95
0
02 Sep 2022
Training a T5 Using Lab-sized Resources
Training a T5 Using Lab-sized Resources
Manuel R. Ciosici
Leon Derczynski
VLM
178
8
0
25 Aug 2022
PromptFL: Let Federated Participants Cooperatively Learn Prompts Instead
  of Models -- Federated Learning in Age of Foundation Model
PromptFL: Let Federated Participants Cooperatively Learn Prompts Instead of Models -- Federated Learning in Age of Foundation ModelIEEE Transactions on Mobile Computing (IEEE TMC), 2022
Tao Guo
Song Guo
Junxiao Wang
Wenchao Xu
FedMLVLMLRM
197
189
0
24 Aug 2022
Multimodal foundation models are better simulators of the human brain
Multimodal foundation models are better simulators of the human brain
Haoyu Lu
Qiongyi Zhou
Nanyi Fei
Zhiwu Lu
Mingyu Ding
...
Changde Du
Xin Zhao
Haoran Sun
Huiguang He
J. Wen
AI4CE
172
19
0
17 Aug 2022
PolarFly: A Cost-Effective and Flexible Low-Diameter Topology
PolarFly: A Cost-Effective and Flexible Low-Diameter TopologyInternational Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2022
Kartik Lakhotia
Maciej Besta
Laura Monroe
K. Isham
Patrick Iff
Torsten Hoefler
Fabrizio Petrini
351
28
0
02 Aug 2022
Dive into Big Model Training
Dive into Big Model Training
Qinghua Liu
Yuxiang Jiang
MoMeAI4CELRM
108
3
0
25 Jul 2022
Machine Learning Model Sizes and the Parameter Gap
Machine Learning Model Sizes and the Parameter Gap
Pablo Villalobos
J. Sevilla
T. Besiroglu
Lennart Heim
A. Ho
Marius Hobbhahn
ALMELMAI4CE
189
78
0
05 Jul 2022
DeepSpeed Inference: Enabling Efficient Inference of Transformer Models
  at Unprecedented Scale
DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented ScaleInternational Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2022
Reza Yazdani Aminabadi
Samyam Rajbhandari
Minjia Zhang
A. A. Awan
Cheng-rong Li
...
Elton Zheng
Jeff Rasley
Shaden Smith
Olatunji Ruwase
Yuxiong He
408
506
0
30 Jun 2022
RevBiFPN: The Fully Reversible Bidirectional Feature Pyramid Network
RevBiFPN: The Fully Reversible Bidirectional Feature Pyramid NetworkConference on Machine Learning and Systems (MLSys), 2022
Vitaliy Chiley
Vithursan Thangarasa
Abhay Gupta
Anshul Samar
Joel Hestness
D. DeCoste
193
13
0
28 Jun 2022
LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient
  Inference in Large-Scale Generative Language Models
LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language ModelsInternational Conference on Learning Representations (ICLR), 2022
Gunho Park
Baeseong Park
Minsub Kim
Sungjae Lee
Jeonghoon Kim
Beomseok Kwon
S. Kwon
Byeongwook Kim
Youngjoo Lee
Dongsoo Lee
MQ
447
113
0
20 Jun 2022
Merak: An Efficient Distributed DNN Training Framework with Automated 3D
  Parallelism for Giant Foundation Models
Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation ModelsIEEE Transactions on Parallel and Distributed Systems (TPDS), 2022
Zhiquan Lai
Shengwei Li
Xudong Tang
Ke-shi Ge
Weijie Liu
Yabo Duan
Linbo Qiao
Dongsheng Li
316
61
0
10 Jun 2022
A New Frontier of AI: On-Device AI Training and Personalization
A New Frontier of AI: On-Device AI Training and Personalization
Jijoong Moon
Parichay Kapoor
Ji Hoon Lee
Donghak Park
Seungbaek Hong
Hyungyu Lee
Donghyeon Jeong
Sungsik Kong
MyungJoo Ham
171
4
0
09 Jun 2022
Can Foundation Models Help Us Achieve Perfect Secrecy?
Can Foundation Models Help Us Achieve Perfect Secrecy?
Simran Arora
Christopher Ré
FedML
245
12
0
27 May 2022
Reducing Activation Recomputation in Large Transformer Models
Reducing Activation Recomputation in Large Transformer ModelsConference on Machine Learning and Systems (MLSys), 2022
V. Korthikanti
Jared Casper
Sangkug Lym
Lawrence C. McAfee
M. Andersch
Mohammad Shoeybi
Bryan Catanzaro
AI4CE
300
385
0
10 May 2022
Training Personalized Recommendation Systems from (GPU) Scratch: Look
  Forward not Backwards
Training Personalized Recommendation Systems from (GPU) Scratch: Look Forward not BackwardsInternational Symposium on Computer Architecture (ISCA), 2022
Youngeun Kwon
Minsoo Rhu
143
30
0
10 May 2022
MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud
MiCS: Near-linear Scaling for Training Gigantic Model on Public CloudProceedings of the VLDB Endowment (PVLDB), 2022
Zhen Zhang
Shuai Zheng
Yida Wang
Justin Chiu
George Karypis
Trishul Chilimbi
Mu Li
Xin Jin
451
47
0
30 Apr 2022
PaLM: Scaling Language Modeling with Pathways
PaLM: Scaling Language Modeling with PathwaysJournal of machine learning research (JMLR), 2022
Aakanksha Chowdhery
Sharan Narang
Jacob Devlin
Maarten Bosma
Gaurav Mishra
...
Kathy Meier-Hellstern
Douglas Eck
J. Dean
Slav Petrov
Noah Fiedel
PILMLRM
1.2K
7,457
0
05 Apr 2022
DELTA: Dynamically Optimizing GPU Memory beyond Tensor Recomputation
DELTA: Dynamically Optimizing GPU Memory beyond Tensor Recomputation
Yu Tang
Chenyu Wang
Yufan Zhang
Yuliang Liu
Xingcheng Zhang
Linbo Qiao
Zhiquan Lai
Dongsheng Li
224
6
0
30 Mar 2022
Pathways: Asynchronous Distributed Dataflow for ML
Pathways: Asynchronous Distributed Dataflow for MLConference on Machine Learning and Systems (MLSys), 2022
P. Barham
Aakanksha Chowdhery
J. Dean
Sanjay Ghemawat
Steven Hand
...
Parker Schuh
Ryan Sepassi
Laurent El Shafey
C. A. Thekkath
Yonghui Wu
GNNMoE
296
145
0
23 Mar 2022
DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale
  Click-Through Rate Prediction
DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate Prediction
Buyun Zhang
Liangchen Luo
Xi Liu
Jay Li
Zeliang Chen
...
Yasmine Badr
Jongsoo Park
Jiyan Yang
Dheevatsa Mudigere
Ellie Wen
3DV
149
12
0
11 Mar 2022
Previous
12345
Next