Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales

Terms and Conditions

Twitter GitHub LinkedIn Bluesky Youtube

© 2026 ResearchTrend.AI, All rights reserved.

Home
Papers
2104.07857
Cited By

ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep
Learning

ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning

International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2021

16 April 2021

Samyam Rajbhandari

Olatunji Ruwase

Yuxiong He

ArXiv (abs)PDF HTML

Papers citing "ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning"

50 / 235 papers shown

ZeRO++: Extremely Efficient Collective Communication for Giant Model
Training

ZeRO++: Extremely Efficient Collective Communication for Giant Model Training

Samyam Rajbhandari

Olatunji Ruwase

Yuxiong He

222

78

0

16 Jun 2023

Full Parameter Fine-tuning for Large Language Models with Limited
Resources

Full Parameter Fine-tuning for Large Language Models with Limited ResourcesAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Qipeng Guo

Xipeng Qiu

330

186

0

16 Jun 2023

Proteus: Simulating the Performance of Distributed DNN Training

Proteus: Simulating the Performance of Distributed DNN TrainingIEEE Transactions on Parallel and Distributed Systems (TPDS), 2023

Xingcheng Zhang

223

13

0

04 Jun 2023

Adam Accumulation to Reduce Memory Footprints of both Activations and
Gradients for Large-scale DNN Training

Adam Accumulation to Reduce Memory Footprints of both Activations and Gradients for Large-scale DNN TrainingEuropean Conference on Artificial Intelligence (ECAI), 2023

Fan Yang

118

5

0

31 May 2023

Automated Tensor Model Parallelism with Overlapped Communication for
Efficient Foundation Model Training

Automated Tensor Model Parallelism with Overlapped Communication for Efficient Foundation Model TrainingIEEE Transactions on Parallel and Distributed Systems (TPDS), 2023

Xiaoge Deng

Dongsheng Li

175

11

0

25 May 2023

Scaling Speech Technology to 1,000+ Languages

Scaling Speech Technology to 1,000+ LanguagesJournal of machine learning research (JMLR), 2023

Paden Tomasello

...

Yossi Adi

391

522

0

22 May 2023

OSDP: Optimal Sharded Data Parallel for Distributed Deep LearningInternational Joint Conference on Artificial Intelligence (IJCAI), 2022

Youhe Jiang

Xiaonan Nie

217

14

0

17 May 2023

DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP
Training

DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP TrainingComputer Vision and Pattern Recognition (CVPR), 2023

Lei Zhang

175

24

0

17 Apr 2023

On Efficient Training of Large-Scale Deep Learning Models: A Literature
Review

On Efficient Training of Large-Scale Deep Learning Models: A Literature Review

Li Shen

Liang Ding

296

51

0

07 Apr 2023

The Online Pause and Resume Problem: Optimal Algorithms and An
Application to Carbon-Aware Load Shifting

The Online Pause and Resume Problem: Optimal Algorithms and An Application to Carbon-Aware Load ShiftingProceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS), 2023

Nicolas H. Christianson

Mohammad Hajiesmaili

Adam Wierman

Prashant J. Shenoy

175

28

0

30 Mar 2023

ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems
for Large-model Training at Scale

ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at ScaleIEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2023

Srinivas Sridharan

Sudarshan Srinivasan

142

83

0

24 Mar 2023

FlexGen: High-Throughput Generative Inference of Large Language Models
with a Single GPU

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPUInternational Conference on Machine Learning (ICML), 2023

...

Joseph E. Gonzalez

Abigail Z. Jacobs

Christopher Ré

451

575

0

13 Mar 2023

A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize
Mixture-of-Experts Training

A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts TrainingInternational Conference on Supercomputing (ICS), 2023

Siddharth Singh

Olatunji Ruwase

Samyam Rajbhandari

Yuxiong He

218

64

0

11 Mar 2023

Angel-PTM: A Scalable and Economical Large-scale Pre-training System in
Tencent

Angel-PTM: A Scalable and Economical Large-scale Pre-training System in TencentProceedings of the VLDB Endowment (PVLDB), 2023

Xiaonan Nie

205

24

0

06 Mar 2023

SWIFT: Expedited Failure Recovery for Large-scale DNN Training

SWIFT: Expedited Failure Recovery for Large-scale DNN TrainingIEEE Transactions on Parallel and Distributed Systems (TPDS), 2023

Hassan M. G. Wassel

Behnam Montazeri

David Wetherall

162

17

0

13 Feb 2023

Exploiting Sparsity in Pruned Neural Networks to Optimize Large Model
Training

Exploiting Sparsity in Pruned Neural Networks to Optimize Large Model TrainingIEEE International Parallel and Distributed Processing Symposium (IPDPS), 2023

Siddharth Singh

275

10

0

10 Feb 2023

Computation vs. Communication Scaling for Future Transformers on Future
Hardware

Computation vs. Communication Scaling for Future Transformers on Future Hardware

Suchita Pati

Shaizeen Aga

Mahzabeen Islam

Nuwan Jayasena

Matthew D. Sinclair

262

14

0

06 Feb 2023

Colossal-Auto: Unified Automation of Parallelization and Activation
Checkpoint for Large-scale Models

Colossal-Auto: Unified Automation of Parallelization and Activation Checkpoint for Large-scale Models

Yang You

216

11

0

06 Feb 2023

SWARM Parallelism: Training Large Models Can Be Surprisingly
Communication-Efficient

SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-EfficientInternational Conference on Machine Learning (ICML), 2023

Tim Dettmers

Alexander Borzunov

365

55

0

27 Jan 2023

Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware
Communication Compression

Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication CompressionInternational Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2023

274

39

0

24 Jan 2023

ATP: Adaptive Tensor Parallelism for Foundation Models

ATP: Adaptive Tensor Parallelism for Foundation Models

Yang You

138

11

0

20 Jan 2023

Systems for Parallel and Distributed Large-Model Deep Learning Training

Systems for Parallel and Distributed Large-Model Deep Learning Training

154

9

0

06 Jan 2023

Elixir: Train a Large Language Model on a Small GPU Cluster

Elixir: Train a Large Language Model on a Small GPU Cluster

Yang You

250

10

0

10 Dec 2022

Deep Incubation: Training Large Models by Divide-and-Conquering

Deep Incubation: Training Large Models by Divide-and-ConqueringIEEE International Conference on Computer Vision (ICCV), 2022

Yulin Wang

Gao Huang

243

13

0

08 Dec 2022

COMET: A Comprehensive Cluster Design Methodology for Distributed Deep
Learning Training

COMET: A Comprehensive Cluster Design Methodology for Distributed Deep Learning Training

Abhimanyu Bambhaniya

Alexandros Daglis

172

11

0

30 Nov 2022

PipeFisher: Efficient Training of Large Language Models Using Pipelining
and Fisher Information Matrices

PipeFisher: Efficient Training of Large Language Models Using Pipelining and Fisher Information MatricesConference on Machine Learning and Systems (MLSys), 2022

Torsten Hoefler

206

36

0

25 Nov 2022

Breadth-First Pipeline Parallelism

Breadth-First Pipeline Parallelism

J. Lamy-Poirier

121

1

0

11 Nov 2022

On Optimizing the Communication of Model Parallelism

On Optimizing the Communication of Model ParallelismConference on Machine Learning and Systems (MLSys), 2022

Joseph E. Gonzalez

197

40

0

10 Nov 2022

Tempo: Accelerating Transformer-Based Model Training through Memory
Footprint Reduction

Tempo: Accelerating Transformer-Based Model Training through Memory Footprint ReductionNeural Information Processing Systems (NeurIPS), 2022

Muralidhar Andoorveedu

Gennady Pekhimenko

185

8

0

19 Oct 2022

Mimose: An Input-Aware Checkpointing Planner for Efficient Training on
GPU

Mimose: An Input-Aware Checkpointing Planner for Efficient Training on GPU

...

Zicheng Zhang

146

4

0

06 Sep 2022

Petals: Collaborative Inference and Fine-tuning of Large Models

Petals: Collaborative Inference and Fine-tuning of Large ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2022

Alexander Borzunov

Dmitry Baranchuk

Tim Dettmers

Artem Chumachenko

224

95

0

02 Sep 2022

Training a T5 Using Lab-sized Resources

Training a T5 Using Lab-sized Resources

Manuel R. Ciosici

Leon Derczynski

178

8

0

25 Aug 2022

PromptFL: Let Federated Participants Cooperatively Learn Prompts Instead
of Models -- Federated Learning in Age of Foundation Model

PromptFL: Let Federated Participants Cooperatively Learn Prompts Instead of Models -- Federated Learning in Age of Foundation ModelIEEE Transactions on Mobile Computing (IEEE TMC), 2022

197

189

0

24 Aug 2022

Multimodal foundation models are better simulators of the human brain

Multimodal foundation models are better simulators of the human brain

Mingyu Ding

...

172

19

0

17 Aug 2022

PolarFly: A Cost-Effective and Flexible Low-Diameter Topology

PolarFly: A Cost-Effective and Flexible Low-Diameter TopologyInternational Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2022

Kartik Lakhotia

Torsten Hoefler

Fabrizio Petrini

351

28

0

02 Aug 2022

Dive into Big Model Training

Dive into Big Model Training

108

3

0

25 Jul 2022

Machine Learning Model Sizes and the Parameter Gap

Machine Learning Model Sizes and the Parameter Gap

Pablo Villalobos

Marius Hobbhahn

189

78

0

05 Jul 2022

DeepSpeed Inference: Enabling Efficient Inference of Transformer Models
at Unprecedented Scale

DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented ScaleInternational Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2022

Reza Yazdani Aminabadi

Samyam Rajbhandari

...

Olatunji Ruwase

Yuxiong He

408

506

0

30 Jun 2022

RevBiFPN: The Fully Reversible Bidirectional Feature Pyramid Network

RevBiFPN: The Fully Reversible Bidirectional Feature Pyramid NetworkConference on Machine Learning and Systems (MLSys), 2022

Vithursan Thangarasa

193

13

0

28 Jun 2022

LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient
Inference in Large-Scale Generative Language Models

LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language ModelsInternational Conference on Learning Representations (ICLR), 2022

447

113

0

20 Jun 2022

Merak: An Efficient Distributed DNN Training Framework with Automated 3D
Parallelism for Giant Foundation Models

Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation ModelsIEEE Transactions on Parallel and Distributed Systems (TPDS), 2022

Dongsheng Li

316

61

0

10 Jun 2022

A New Frontier of AI: On-Device AI Training and Personalization

A New Frontier of AI: On-Device AI Training and Personalization

Parichay Kapoor

Donghyeon Jeong

171

4

0

09 Jun 2022

Can Foundation Models Help Us Achieve Perfect Secrecy?

Can Foundation Models Help Us Achieve Perfect Secrecy?

Christopher Ré

245

12

0

27 May 2022

Reducing Activation Recomputation in Large Transformer Models

Reducing Activation Recomputation in Large Transformer ModelsConference on Machine Learning and Systems (MLSys), 2022

Lawrence C. McAfee

Mohammad Shoeybi

Bryan Catanzaro

300

385

0

10 May 2022

Training Personalized Recommendation Systems from (GPU) Scratch: Look
Forward not Backwards

Training Personalized Recommendation Systems from (GPU) Scratch: Look Forward not BackwardsInternational Symposium on Computer Architecture (ISCA), 2022

143

30

0

10 May 2022

MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud

MiCS: Near-linear Scaling for Training Gigantic Model on Public CloudProceedings of the VLDB Endowment (PVLDB), 2022

George Karypis

Trishul Chilimbi

Xin Jin

451

47

0

30 Apr 2022

PaLM: Scaling Language Modeling with Pathways

PaLM: Scaling Language Modeling with PathwaysJournal of machine learning research (JMLR), 2022

Aakanksha Chowdhery

Sharan Narang

...

Kathy Meier-Hellstern

1.2K

7,457

0

05 Apr 2022

DELTA: Dynamically Optimizing GPU Memory beyond Tensor Recomputation

DELTA: Dynamically Optimizing GPU Memory beyond Tensor Recomputation

Yuliang Liu

Xingcheng Zhang

Dongsheng Li

224

6

0

30 Mar 2022

Pathways: Asynchronous Distributed Dataflow for ML

Pathways: Asynchronous Distributed Dataflow for MLConference on Machine Learning and Systems (MLSys), 2022

Aakanksha Chowdhery

Sanjay Ghemawat

...

Laurent El Shafey

296

145

0

23 Mar 2022

DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale
Click-Through Rate Prediction

DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate Prediction

...

Dheevatsa Mudigere

149

12

0

11 Mar 2022