v1v2v3v4 (latest)

Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models

IEEE Transactions on Parallel and Distributed Systems (TPDS), 2022

10 June 2022

Dongsheng Li

Papers citing "Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models"

13 / 13 papers shown

AdaPtis: Reducing Pipeline Bubbles with Adaptive Pipeline Parallelism on Heterogeneous Models

28 Sep 2025

LeMix: Unified Scheduling for LLM Training and Inference on Multi-GPU Systems

142

28 Jul 2025

Exploiting Block Coordinate Descent for Cost-Effective LLM Model Training

264

23 May 2025

Mist: Efficient Distributed Training of Large Language Models via Memory-Parallelism Co-OptimizationEuropean Conference on Computer Systems (EuroSys), 2025

Zhanda Zhu

Christina Giannoula

Muralidhar Andoorveedu

224

24 Mar 2025

Real-time and Downtime-tolerant Fault Diagnosis for Railway Turnout Machines (RTMs) Empowered with Cloud-Edge Pipeline Parallelism

122

04 Nov 2024

Poplar: Efficient Scaling of Distributed DNN Training on Heterogeneous GPU ClustersAAAI Conference on Artificial Intelligence (AAAI), 2024

165

22 Aug 2024

Efficient Training of Large Language Models on Distributed Infrastructures: A Survey

...

Dahua Lin

Yonggang Wen

Xin Jin

Tianwei Zhang

Yang Liu

363

29 Jul 2024

Optimizing Large Model Training through Overlapped Activation Recomputation

...

452

13 Jun 2024

InternEvo: Efficient Long-sequence Large Language Model Training via Hybrid Parallelism and Redundant Sharding

...

Xin Jin

300

17 Jan 2024

Oobleck: Resilient Distributed Training of Large Models Using Pipeline TemplatesSymposium on Operating Systems Principles (SOSP), 2023

Xin Jin

257

15 Sep 2023

Automated Tensor Model Parallelism with Overlapped Communication for Efficient Foundation Model TrainingIEEE Transactions on Parallel and Distributed Systems (TPDS), 2023

Xiaoge Deng

Dongsheng Li

KaiCheng Lu

174

25 May 2023

A 4D Hybrid Algorithm to Scale Parallel Training to Thousands of GPUs

229

22 May 2023

Colossal-Auto: Unified Automation of Parallelization and Activation Checkpoint for Large-scale Models

Yang You

208

06 Feb 2023